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;-H Abstract 

,^ In many applications, one has side information, e.g., labels that are provided in a semi- 

supervised manner, about a specific target region of a large data set, and one wants to perform 
machine learning and data analysis tasks "nearby" that prespecified target region. For example, 
one might be interested in the clustering structure of a data graph near a prespecified "seed set" 

I— I of nodes, or one might be interested in finding partitions in an image that are near a prespecified 

\^ "ground truth" set of pixels. Locally-biased problems of this sort are particularly challenging for 

^J popular eigenvector-based machine learning and data analysis tools. At root, the reason is that 

^ eigenvectors are inherently global quantities, thus limiting the applicability of eigenvector-based 

O methods in situations where one is interested in very local properties of the data. 

In this paper, we address this issue by providing a methodology to construct semi- supervised 

^-H eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can 

^ be used to perform locally-biased machine learning. These semi-supervised eigenvectors cap- 

OO ture successive ly-orthogonalized directions of maximum variance, conditioned on being well- 

correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised 
manner. We show that these semi-supervised eigenvectors can be computed quickly as the so- 
lution to a system of linear equations; and we also describe several variants of our basic method 
that have improved scaling properties. We provide several empirical examples demonstrating 

T^ how these semi-supervised eigenvectors can be used to perform locally-biased learning; and we 

discuss the relationship between our results and recent machine learning algorithms that use 
global eigenvectors of the graph Laplacian. 
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H 1 Introduction 



In many applications, one has information about a specific target region of a large data set, and 
one wants to perform common machine learning and data analysis tasks "nearby" the pre-specified 
target region. In such situations, eigenvector-based methods such as those that have been popular 
in machine learning in recent years tend to have serious difficulties. At root, the reason is that 
eigenvectors, e.g., of Laplacian matrices of data graphs, are inherently global quantities, and thus 
they might not be sensitive to very local information. Motivated by this, we consider the problem of 
finding a set of locally-biased vectors — we will call them semi- supervised eigenvectors — that inherit 
many of the "nice" properties that the leading nontrivial global eigenvectors of a graph Laplacian 
have — for example, that capture "slowly varying" modes in the data, that are fairly-efficiently 
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computable, that can be used for common machine learning and data analysis tasks such as kernel- 
based and semi-supervised learning, etc. — so that we can perform what we will call locally-biased 
machine learning in a principled manner. 

1.1 Locally-biased Learning 

By locally-biased machine learning, we mean that we have a data set, e.g., represented as a graph, 
and that we have information, e.g., given in a semi-supervised manner, that certain "regions" of the 
data graph are of particular interest. In this case, we may want to focus predominantly on those 
regions and perform data analysis and machine learning, e.g., classification, clustering, ranking, 
etc., that is "biased toward" those pre-specified regions. Examples of this include the following. 

• Locally-biased community identification. In social and information network analysis, one 
might have a small "seed set" of nodes that belong to a cluster or community of interest p|[35]: 
in this case, one might want to perform link or edge prediction, or one might want to "refine" 
the seed set in order to find other nearby members. 

• Locally-biased image segmentation. In computer vision, one might have a large corpus of 
images along with a "ground truth" set of pixels as provided by a face detection algorithm [lU 
[371 [38]; ill this case, one might want to segment entire heads from the background for all the 
images in the corpus in an automated manner. 

• Locally-biased neural connectivity analysis. In functional magnetic resonance imaging ap- 
plications, one might have small sets of neurons that "fire" in response to some external 
experimental stimulus [42]; in this case, one might want to analyze the subsequent temporal 
dynamics of stimulation of neurons that are "nearby," either in terms of connectivity topology 
or functional response, members of the original set. 

In each of these examples, the data are modeled by a graph — which is either "given" from the ap- 
plication domain or is "constructed" from feature vectors obtained from the application domain — 
and one has information that can be viewed as semi-supervised in the sense that it consists of 
exogeneously-specified "labels" for the nodes of the graph. In addition, there are typically a 
relatively-small number of labels and one is interested in obtaining insight about the data graph 
nearby those labels. 

These examples present considerable challenges for standard global spectral techniques and 
other traditional eigenvector-based methods. (Such eigenvector-based methods have received at- 
tention in a wide range of machine learning and data analysis applications in recent years. They 
have been useful, for example, in non-linear dimensionality reduction [7^T3]; in kernel-based ma- 
chine learning [53j; in Nystrom-based learning methods [65j_58j; spectral partitioning [SOj [SH [JT] , 
and so on.) At root, the reason is that eigenvectors are inherently global quantities, thus limiting 
their applicability in situations where one is interested in very local properties of the data. That is, 
very local information can be "washed out" and essentially invisible to these globally-optimal vec- 
tors. For example, a sparse cut in a graph may be poorly correlated with the second eigenvector and 
thus invisible to a method based only on eigenvector analysis. Similarly, if one has semi-supervised 
information about a specific target region in the graph, as in the above examples, one might be in- 
terested in finding clusters near this prespecified local region in a semi-supervised manner; but this 
local region might be essentially invisible to a method that uses only global eigenvectors. Finally, 



one might be interested in using kernel-based methods to find "local correlations" or to regularize 
with respect to a "local dimensionality" in the data, but this local information might be destroyed 
in the process of constructing kernels with traditional kernel-based methods. 

1.2 Semi-supervised Eigenvectors 

In this paper, we provide a methodology to construct what we will call semi- supervised eigenvectors 
of a graph Laplacian; and we illustrate how these locally-biased eigenvectors (locally-biased in the 
sense that they will be well-correlated with the input seed set of nodes or that most of their "mass" 
will be on nodes that are "near" that seed set) inherit many of the properties that make the leading 
nontrivial global eigenvectors of the graph Laplacian so useful in applications. In order to make this 
method useful, there should ideally be a "knob" that allows us to interpolate between very local and 
the usual global eigenvectors, depending on the application at hand; we should be able to use these 
vectors in common machine learning pipelines to perform common machine learning tasks; and 
the intuitions that make the leading k nontrivial global eigenvectors of the graph Laplacian useful 
should, to the extent possible, extend to the locally-biased setting. To achieve this, we will formulate 
an optimization ansatz that is a variant of the usual global spectral graph partitioning optimization 
problem that includes a natural locality constraint as well as an orthogonality constraint, and we 
will iteratively solve this problem. 

In more detail, assume that we are given as input a (possibly weighted) data graph G = (V^ E), 
an indicator vector 5 of a small "seed set" of nodes, a correlation parameter k G [0,1], and a 
positive integer k. Then, informally, we would like to construct k vectors that satisfy the following 
bicriteria: first, each of these k vectors is well-correlated with the input seed set; and second, 
those k vectors describe successively-orthogonalized directions of maximum variance, in a manner 
analogous to the leading k nontrivial global eigenvectors of the graph Laplacian. (We emphasize 
that the seed set s of nodes, the integer A:, and the correlation parameter hc are part of the input; 
and thus they should be thought of as being available in a semi-supervised manner.) Somewhat 
more formally, our main algorithm. Algorithm [l] in Section [3| returns as output k semi-supervised 
eigenvectors; each of these is the solution to an optimization problem of the form of Generalized 
LocalSpectral in Figure [l] and thus each "captures" (say) Hi/k of the correlation with the seed 
set. Our main theoretical result, described in Section [3) states that these vectors define successively- 
orthogonalized directions of maximum variance, conditioned on being /^/fc-well-correlated with an 
input seed set 5; and that each of these k semi-supervised eigenvectors can be computed quickly 
as the solution to a system of linear equations. To extend the practical applicability of this basic 
result, we will in Section [4] describe several heuristic extensions of our basic framework that will 
make it easier to apply the method of semi-supervised eigenvectors at larger size scales. These 
extensions involve using the so-called Nystrom method, computing one locally-biased eigenvector 
and iteratively "peelling off" successive components of interest, as well as performing random walks 
that are "local" in a stronger sense than our basic method considers. 

Finally, in order to illustrate how the method of semi-supervised eigenvectors performs in prac- 
tice, we also provide a detailed empirical evaluation using a wide range of both small-scale as well as 
larger-scale data. In particular, we consider two small data sets, one consisting of graphs generated 
from a popular network generation model and the other data drawn from Congressional roll call 
voting patterns, in order to illustrate the basic method; we consider graphs constructed from the 
widely-studied MNIST digit data, in order to illustrate how the method performs on a data set 
that is widely-known in the machine learning community; and we consider two larger data sets, one 



consisting of Internet graphs and the other consisting of graphs constructed from fMRI medical 
imaging, in order to illustrate how the method performs in larger-scale applications. 

1.3 Related Work 

From a technical perspective, the work most closely related to ours is the recently-developed "local 
spectral method" of Mahoney et al. [37]. The original algorithm of Mahoney et al. [37] introduced 
a methodology to construct a locally-biased version of the leading nontrivial eigenvector of a graph 
Laplacian and also showed (theoretically and empirically in a social network analysis application) 
that that the resulting vector could be used to partition a graph in a locally-biased manner. From 
this perspective, our extension incorporates a natural orthogonality constraint that successive vec- 
tors need to be orthogonal to previous vectors. Subsequent to the work of [37], [38] applied the 
algorithm of [37j to to the problem of finding locally-biased cuts in a computer vision application. 
Similar ideas have also been applied somewhat differently. For example, [2J use locally-biased ran- 
dom walks, e.^., short random walks starting from a small seed set of nodes, to find clusters and 
communities in graphs arising in Internet advertising applications; [35] used locally-biased random 
walks to characterize the local and global clustering structure of a wide range of social and infor- 
mation networks; and [29] developed the Spectral Graph Transducer, which performs transductive 
learning via spectral graph partitioning. 

The objectives in both [29j and [37j are constrained eigenvalue problems that can be solved by 
finding the smallest eigenvalue of an asymmetric generalized eigenvalue problem; but in practice this 
procedure can be highly unstable f20]. The algorithm of [29j reduces the instabilities by performing 
all calculations in a subspace spanned by the d smallest eigenvectors of the graph Laplacian; whereas 
the algorithm of [37] performs a binary search, exploiting the monotonic relationship between 
a control parameter and the corresponding Lagrange multiplier. The form of our optimization 
problem also has similarities to other work in computer vision applications: e.^., [66j and [18j find 
good conductance clusters subject to a set of linear constraints. 

In parallel, [3J and a large body of subsequent work including [d^ used (the usual global) 
eigenvectors of the graph Laplacian to perform dimensionality reduction and data representation, 
in unsupervised and semi-supervised settings j59l EU [67] . Typically, these methods construct some 
sort of local neighborhood structure around each data point, and they optimize some sort of global 
objective function to go "from local to global" [52j. In some cases, these methods can be understood 
in terms of data drawn from an hypothesized manifold [5], and more generally they have proven 
useful for denoising and learning in semi-supervised settings [H [6] . These methods are based on 
spectral graph theory [12] ; and thus many of these methods have a natural interpretation in terms 
of diffusions and kernel-based learning [53l EH |57l [lU [23] . These interpretations are important for 
the usefulness of these global eigenvector methods in a wide range of applications. As we will see, 
many (but not all) of these interpretations can be ported to the "local" setting, an observation that 
was made previously in a different context [14j. 

Many of these diffusion-based spectral methods also have a natural interpretation in terms 
of spectral ranking [61j. "Topic sensitive" and "personalized" versions of these spectral ranking 
methods have also been studied fSH^^'SBJ; and these were the motivation for diffusion-based methods 
to find locally-biased clusters in large graphs [551(11 [37]. Our optimization ansatz is a generalization 
of the linear equation formulation of the PageRank procedure [131 I3Z1 ISI] ; and its solution involves 
Laplacian-based linear equation solving, which has been suggested as a primitive is of more general 
interest in large-scale data analysis [60j. 



1.4 Outline of the Paper 

In the next section, Section [2| we will provide notation and some background and discuss related 
work. Then, in Section |3] we will present our main algorithm and our main theoretical result 
justifying the algorithm; and in Section |4] we will present several extensions of our basic method 
that will help for certain larger-scale applications of the method of semi-supervised eigenvectors. In 
Section [5j we present an empirical analysis, including both toy data to illustrate how the "knobs" of 
our method work, as well as applications to realistic machine learning and data analysis problems; 
and in Section [6] we present a brief discussion and conclustion. 

2 Background and Notation 

Let G — {V,E,w) be a connected undirected graph with n = \V\ vertices and m = \E\ edges, 
in which edge {i, j} has weight Wij. For a set of vertices S C V in a, graph, the volume of S is 

def def 

vol(S') = X^iGS'^^' i^ which case the volume of the graph G is vol(G) = vol(y) = 2m. In the 
following, Ag G R^^^ will denote the adjacency matrix of G, while Dq ^ M^^^ will denote the 
diagonal degree matrix of G, z.e., Dcii^i) — di — J2{i j}eE^ij^ ^^^ weighted degree of vertex i. 

The Laplacian of G is defined as Lq = Dq — Aq- (This is also called the combinatorial Laplacian, 
in which case the normalized Laplacian of G is Cq — Dq ' LqDq ' .) 

The Laplacian is the symmetric matrix having quadratic form x^ Lqx — J^ijeE^^ji^i ~ ^j)^' 
for X G M^. This implies that Lq is positive semidefinite and that the all-one vector 1 G MX is 
the eigenvector corresponding to the smallest eigenvalue 0. The generalized eigenvalues of Lqx = 
XiDcx are = Ai < A2 < • • • < Xn- We will use V2 to denote smallest non-trivial eigenvector, i.e., 
the eigenvector corresponding to A2; vs to denote the next eigenvector; and so on. We will overload 
notation to use X2{A) to denote the smallest non-zero generalized eigenvalue of A with respect to 
Dg- Finally, for a matrix A, let A'^ denote its (uniquely defined) Moore-Penrose pseudoinverse. For 
two vectors x, ?/ E M^, and the degree matrix Dq for a graph G, we define the degree-weighted inner 

product as x^Dcy — YM=i^iyi^i' I^ particular, if a vector x has unit norm, then x^Dqx — 1. 
Given a subset of vertices 5 C y, we denote by l^- the indicator vector of S in R^ and by 1 the 
vector in R^ having all entries set equal to 1. 

3 Optimization Approach to Semi-supervised Eigenvectors 

In this section, we provide our main technical results: a motivation and statement of our optimiza- 
tion ansatz; our main algorithm for computing semi-supervised eigenvectors; and an analysis that 
our algorithm computes solutions of our optimization ansatz. 

3.1 Motivation for the Program 

Recall the optimization perspective on how one computes the leading nontrivial global eigenvectors 
of the normalized Laplacian Cq or, equivalently, of the leading nontrivial generalized eigenvectors 
of Lg- The first nontrivial eigenvector V2 is the solution to the problem GlobalSpectral that 
is presented on the left of Figure [T} Equivalently, although GlobalSpectral is a non-convex 
optimization problem, strong duality holds for it and it's solution may be computed as V2^ the 
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Figure 1: Left: The usual GlobalSpectral partitioning optimization problem; the vector achiev- 
ing the optimal solution is i;2, the leading nontrivial generalized eigenvector of Lq with respect to 
Dq. Middle: The LocalSpectral optimization problem, which was originally introduced in [37] : 
for n — ^^ this coincides with the usual global spectral objective, while for /^ > 0, this produces 
solutions that are biased toward the seed vector s. Right: The Generalized LocalSpectral 
optimization problem we introduce that includes both the locality constraint and a more general 
orthogonality constraint. Our main algorithm for computing semi-supervised eigenvectors will it- 
eratively compute the solution to Generalized LocalSpectral for a sequence of X matrices. 
In all three cases, the optimization variable is x G M^. 

leading nontrivial generalized eigenvector of Lq- (In this case, the value of the objective is A2, and 
global spectral partitioning involves then doing a "sweep cut" over this vector and appealing to 
Cheeger's inequality.) The next eigenvector V'^ is the solution to GlobalSpectral, augmented 
with the constraint that x^ DgV2 — 0; and in general the t^^ generalized eigenvector of Lq is 
the solution to GlobalSpectral, augmented with the constraints that x^DcVi = 0, for i G 
{2, ... ,t — 1}. Clearly, this set of constraints and the constraint x^Dq^ — can be written as 
x^DgX = 0, where is a (t — l)-dimensional all-zeros vector, and where X is an n x (t — 1) 
orthogonal matrix whose i^^ column equals Vi (where i;i = 1, the all-ones vector, is the first column 
ofX). 

Also presented in Figure [l] is LocalSpectral, which includes a constraint that the solution be 
well-correlated with an input seed set. This LocalSpectral optimization problem was introduced 
in [37J, where it was shown that the solution to LocalSpectral may be interpreted as a locally- 
biased version of the second eigenvector of the Laplacianj^ In particular, although LocalSpectral 
is not convex, it's solution can be computed efficiently as the solution to a set of linear equations 
that generalize the popular Personalized PageRank procedure; in addition, by performing a sweep 
cut and appealing to a variant of Cheeger's inequality, this locally-biased eigenvector can be used 
to perform locally-biased spectral graph partitioning [37j . 

3.2 Our Main Algorithm 

We will formulate the problem of computing semi-supervised vectors in terms of a primitive op- 
timization problem of independent interest. Consider the Generalized LocalSpectral opti- 
mization problem, as shown in Figure IT} For this problem, we are given a graph G = {V, E), with 

^In [37j, the locality constraint was actually a quadratic constraint, and thus a somewhat involved analysis was 
required. In retrospect, given the form of the solution, and in light of the discussion below, it is clear that the 
quadratic part was not "real," and thus we present this simpler form of LocalSpectral here. This should make 
the connections with our Generalized LocalSpectral objective more immediate. 



associated Laplacian matrix Lq and diagonal degree matrix Dq] an indicator vector 5 of a small 
"seed set" of nodes; a correlation parameter k G [0,1]; and an n x z/ constraint matrix X that 
may be assumed to be an orthogonal matrix. We will assume (without loss of generality) that s 
is properly normalized and orthogonalized so that s^ Dqs — 1 and s^Dq^ = 0. While s can be a 
general unit vector orthogonal to 1, it may be helpful to think of s as the indicator vector of one 
or more vertices in V^ corresponding to the target region of the graph. 

In words, the problem Generalized LocalSpectral asks us to find a vector x G M^ that 
minimizes the variance x^ Lqx subject to several constraints: that x is unit length; that x is 
orthogonal to the span of X; and that x is a/^- well-correlated with the input seed set vector 
s. In our application of Generalized LocalSpectral to the computation of semi-supervised 
eigenvectors, we will iteratively compute the solution to Generalized LocalSpectral, updating 
X to contain the already-computed semi-supervised eigenvectors. That is, to compute the first semi- 
supervised eigenvector, we let X = 1, i.e., the n-dimensional all-ones vector, which is the trivial 
eigenvector L^, in which case X is an nx 1 matrix; and to compute each subsequent semi-supervised 
eigenvector, we let the columns of X consist of 1 and the other semi-supervised eigenvectors found 
in each of the previous iterations. 

To show that Generalized LocalSpectral is efficiently-solvable, note that it is a quadratic 
program with only one quadratic constraint and one linear equality constraint jj In order to remove 
the equality constraint, which will simplify the problem, let's change variables by defining the 
n X (n — u) matrix F as 

{x : X^Dgx = 0} = {x : X = Fx}. 

That is, F is a span for the null space of X^; and we will take F to be an orthogonal matrix. In 
particular, this implies that F^F is an {n — u) x [n — v) Identity and FF^ is an n x n Projection. 
Then, with respect to the x variable. Generalized LocalSpectral becomes 

minimize x F LcFy 
y 

subject to SFf^DgFx = 1, (1) 

x^F^Dgs > v^. 

In terms of the variable x, the solution to this optimization problem is of the form 

X* = cF{F^ {Lg-7Dg)F)^ F^Dgs 

= c{FF^{Lg-7Dg)FF^)^Dgs, (2) 

for a normalization constant c G (0, oc) and for some 7 that depends on ^/R. The second line 
follows from the first since F is an n x (n — z/) orthogonal matrix. This so-called "S-proceudre" is 
described in greater detail in Chapter 5 and Appendix B of [lOj. The significance of this is that, 
although it is a non-convex optimization problem, the Generalized LocalSpectral problem 
can be solved by solving a linear equation, in the form given in Eqn. ([2]). 

Returning to our problem of computing semi-supervised eigenvectors, recall that, in addition to 
the input for the Generalized LocalSpectral problem, we need to specify a positive integer 
k that indicates the number of vectors to be computed. In the simplest case, we would assume 



^Alternatively, note that it is an example of an constrained eigenvalue problem [20]. We thank the numerous 
individuals who pointed this out to us subsequent to our dissemination of the original version of this paper. 



that we would like the correlation to be "evenly distributed" across all k vectors, in which case 
we will require that each vector is y^/^- well-correlated with the input seed set vector s; but this 
assumption can easily be relaxed, and thus Algorithm [l] is formulated more generally as taking a 
fc-dimensional vector /^ = [a^i, . . . , Kk]^ of correlation coefficients as input. 

To compute the first semi-supervised eigenvector, we will let X = 1, the all-ones vector, in 
which case the first nontrivial semi-supervised eigenvector is 

xI = c{Lg-JiDg)^Dgs, (3) 

where 71 is chosen to saturate the part of the correlation constraint along the first direction. (Note 
that the projections FF^ from Eqn. ^ are not present in Eqn. ^ since by design s^DqI — 0.) 
That is, to find the correct setting of 71, it suffices to perform a binary search over the possible 
values of 71 in the interval (— vol(G), A2(G)) until the correlation constraint is satisfied, that is, 
until {s^DgXi)'^ is sufficiently close to ki. 

To compute subsequent semi-supervised eigenvectors, i.e., at steps t = 2, . . . , fc if one ultimately 
wants a total of k semi-supervised eigenvectors, then one lets X be the n x t matrix of the form 

X = [l,xl,...,xU], (4) 

where x^, . . . ,x^_i are successive semi-supervised eigenvectors; and the projection matrix FF^ is 
of the form 

FF^ = / - DgX{X^DgDgX)-^X^Dg, 

due to the the degree-weighted inner norm. 

Then, by Eqn. ([2]), the t^^ semi-supervised eigenvector takes the form 

xl = c{FF^{LG-^tDG)FF^fDGS. 



Algorithm 1 Main algorithm to compute semi-supervised eigenvectors 

Require: L^, Dg, 5, /^ = [a^i, . . . , HikV , e such that s^DgI = 0, s^Dgs = 1, n^l < 1 

1: X = [1] 

2: for t = 1 to A: do 

3: FF^ ^ / - DgX{X^DgDgX)-^X^Dg 

4: T ^ A2 where FF^LgFF^V2 = \2FF^DgFF^V2 

5: ± ^ -VOl(G) 

6: repeat 

7t ^ (^ + T)/2 (Binary search over 7^) 
xt ^ {FF^{Lg - jtDG)FF^yFF^DGS 



9 
10: 
11 
12 
13 



Normalize xt such that xj DgXi — 1 

if {xJDgs)'^ > Kf then _L ^ 7^ else T ^ 7^ end if 
until \\{xjDGs)^-^t\\ <eor ||(± + T)/2 - 7t|| <e 
Augment X with x^ by letting X = [X,x^]. 
end for 



In more detail. Algorithm [T] presents pseudo-code for our main algorithm for computing semi- 
supervised eigenvectors. The algorithm takes as input a graph G = (y, £"), a seed set s (which could 



be a general vector s G M^, subject for simplicity to the normalization constraints s^Dq^ — and 
s^ Dqs = 1, but which is most easily thought of as an indicator vector for the local "seed set" of 
nodes), a number k of vectors we want to compute, and a vector of locality parameters (/^i, . . . , Hk)^ 
where tii G [0, 1] and X^^^i i^i — ^ (where, in the simplest case, one could choose ni — njk^ \li^ for 
some K G [0, 1]). Several things should be noted about our implementation of our main algorithm. 
First, as we will discuss in more detail below, we compute the projection matrix FF^ only implicitly. 
Second, a naive approach to Eqn. ^ does not immediately lead to an efficient solution, since Dgs 
will not be in the span of {FF^{Lg — jDg)FF^)^ thus leading to a large residual. By changing 
variables so that x = FF^y, the solution becomes 

X* oc FF^{FF^{Lg - -ftDG)FF^)^FF^DGS. 

Since FF^ is a projection matrix, this expression is equivalent to 

X* oc {FF^(Lg - -itDG)FF^)^ FF^Dgs. (5) 

Third, regarding the solution x^, we exploit that FF^{Lg — jiDG)FF^ is an SPSD matrix, and 
we apply the conjugate gradient method, rather than computing the explicit pseudoinverse. That 
is, in the implementation we never explicitly represent the dense matrix FF^ , but instead we 
treat it as an operator and we simply evaluate the result of applying a vector to it on either side. 
Fourth, we use that A2 can never decrease (here we refer to A2 as the smallest non-zero eigenvalue 
of the modified matrix), so we only recalculate the upper bound for the binary search when an 
iteration saturates without satisfying \\{xfDGs)'^ — Kt\\ ^ ^- Estimating the bound is critical for 
the semi-supervised eigenvectors to be able to interpolate all the way to the global eigenvectors 



of the graph, so in Section |3.4| we return to a discussion on efficient strategies for computing the 
leading nontrivial eigenvalue of Lg projected down onto the space perpendicular to the previously 
computed solutions. 

From this discussion, it should be clear that Algorithm [l] solves the semi-supervised eigenvector 
problem by solving in an iterative manner optimization problems of the form of Generalized 
LocalSpectral; and that the running time of Algorithm [l] boils down to solving a sequence of 
linear equations. 

3.3 Discussion of Our Main Algorithm 

There is a natural "regularization" interpretation underlying our construction of semi-supervised 
eigenvectors. To see this, recall that the first step of our algorithm can be computed as the solution 
of a set of linear equations 

x* = c{LG--fDG)^DGS, (6) 

for some normalization constant c and some 7 that can be determined by a binary search over 
(— vol(G), A2(G)); and that subsequent steps compute the analogous quantity, subject to addi- 
tional constraints that the solution be orthogonal to the previously-computed vectors. The quan- 
tity {Lg — ^Dg)^ can be interpreted as a "regularized" version of the pseudoinverse of L, where 
7 G (— oc,A2(G)) serves as the regularization parameter. This interpretation has recently been 
made precise: [3Q\ show that running a PageRank computation — as well as running other diffusion- 
based procedures — exactly optimizes a regularized version of the GlobalSpectral (or Local- 
Spectral, depending on the input seed vector) problem; and [47j provide a precise statistical 
framework justifying this. 



The usual interpretation of PageRank involves "random walkers" who uniformly (or non- 
uniformly, in the case of Personalized PageRank) "teleport" with a probability a G (0,1). As 
described in [37], choosing a E (0, 1) corresponds to choosing 7 E (— oc, 0). By rearranging Eqn. ([g]) 
as 

X* = c{{DG-AG)--fDG)^DGS 

Dg - Ag Dgs 



1 — 7 \ 1 — 7 

c 



1-7 



^~g' [l - Y^AgD^}^ Dgs, 



we recognize AgDq^ as the standard random walk matrix, and it becomes immediate that the 
solution based on random walkers. 



X 



1-7 



^5^(^+E(r^^5^^«)V««' 



is divergent for 7 > 0. Since 7 = A2(G) corresponds to no regularization and 7 -^ — oc corresponds 
to heavy regularization, viewing this problem in terms of solving a linear equation is formally more 
powerful than viewing it in terms of random walkers. That is, while all possible values of the reg- 
ularization parameter — and in particular the (positive) value A2(-) — are achievable algorithmically 
by solving a linear equation, only values in (— oo,0) are achievable by running a PageRank diffu- 
sion. In particular, if the optimal value of 7 that saturates the A^-dependent locality constraint is 
negative, then running the PageRank diffusion could find it; otherwise, the "best" one could do will 
still not saturate the locality constraint, in which case some of the intended correlation is "unused." 

An important technical and practical point has to do with the precise manner in which the 
i^^ vector is well-correlated with the seed set s. In our formulation, this is captured by a locality 
parameter 7^ that is chosen (via a binary search) to "saturate" the correlation condition, z.e., so that 
the i^^ vector is A^/fc-well-correlated with the input seed set. As a general rule, successive 7^s need to 
be chosen that successive vectors are less well-localized around the input seed set. (Alternatively, 
depending on the application, one could choose this parameter so that successive 7iS are equal; but 
this will involve "sacrificing" some amount of the ti/k correlation, which will lead to the last or 
last few eigenvectors being very poorly-correlated with the input seed set. These tradeoffs will be 
described in more detail below.) Informally, if 5 is a seed set consisting of a small number of nodes 
that are "nearby" each other, then to maintain a given amount of correlation, we must "view" the 
graph over larger and larger size scales as we compute more and more semi-supervised eigenvectors. 
More formally, we need to let the value of the regularization parameter 7 at the i^^ round, we call it 
7i, vary for each i E {1, . . . , A:}. That is, 7^ is not pre-specified, but it is chosen via a binary search 
over the region (— vol(G), A2(-)), where A2(-) is the leading nontrivial eigenvalue of Lg projected 
down onto the space perpendicular to the previously-computed vectors (which is in general larger 
than A2(G)). In this sense, our semi-supervised eigenvectors are both "locally-biased", in a manner 
that depends on the input seed vector and correlation parameter, and "regularized", in a manner 
that depends on the local graph structure. 

To illustrate the previous discussion. Figure [2] considers the two-dimensional grid. In each 
subfigure, the blue curve shows the correlation with a single seed node as a function of 7 for the 
leading semi-supervised eigenvector, and the black dot illustrates the value of 7 for three different 
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(a) {xjDcsf = 0.3 



(b) (xJDgs)^ = 0.2 



(c) {xjDcsf = 0.1 



Figure 2: Interplay between the 7 parameter and the correlation k that a semi-supervised eigenvec- 
tor has with a seed s on a two-dimensional grid. In Figure [2(a)]|2 (c) \ we vary the locality parameter 
for the leading semi-supervised eigenvector, which in each case leads to a value of 7 which is marked 
by the black dot on the blue curve. This allows us to illustrate the influence on the relationship 



between 7 and Hi on the next semi-supervised eigenvector. Figure 2(a) also highlights the range 
(7 < 0) in which Personalized PageRank can be used for computing solutions to semi-supervised 
eigenvectors. 



values of the locality parameter k. This relationship between k and 7 is in general non-convex, 
but it is monotonic for 7 G (— vol(G), A2(G)). The red curve in each subfigure shows the decay for 
the second semi-supervised eigenvector. Recall that it is perpendicular to the first semi-supervised 
eigenvector, that the decay is monotonic for 7 G (— vol(G), A2(G)), and that A2 < A2 < A3. In 



Figure 2(a), the first semi-supervised eigenvector is not "too" close to A2, and so A2 (i.e., the 
second eigenvalue of the next semi-supervised eigenvector) increases just slightly. In Figure |2(b)| 
we consider a locality parameter that leads to a value of 7 that is closer to A2, thereby increasing the 
value of A2. Finally, in Figure 2(c)[ the locality parameter is such that the leading semi-supervised 
eigenvector almost coincides with V2; this results in A2 ^ A3, as required if we were to compute the 
global eigenvectors. 



3.4 Bounding the Binary Search 

For the following derivations it is more convenient to consider the normalized graph Laplacian, in 
which case we define the first solution as 



yi = c{Cg -111)' 



Dfs 



(7) 



-1/2, 



where x\ = Dq yi. This approach is convenient since the projection operator with null space 



defined by previous solutions can be expressed as FF'^ 
is, Y is of the form 

Y = [Dy^ 



'G 



,yi, 



I - YY^ , assuming that Y^Y ^1. That 



Vt-ili 



where y* are successive solutions to Eqn. (171). In the following the type of projection operator will 
be implicit from the context, i.e., when working with the combinatorial graph Laplacian FF'^ — 
I — DqX {X^ DgDgX)~^ X^ Dq, whereas for the normalized graph Laplacian FF'^ = I — YY^ . 
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For the normalized graph Laplacian Cg-^ the eigenvalues of Hqv — Xv equal the eigenvalues of 
the generalized eigenvalue problem Lqv = XDgv. The binary search employed in Algorithm [l] uses 
a monotonic relationship between the 7 G (— vol(G), A2(-)) parameter and the correlation with the 
seed x^ Dqs^ that can be deduced from the KKT-conditions [37j. Note, that if the upper bound 
for the binary search T = \2{FF^ CgFF^) is not determined with sufficient precision, the search 
will (if we underestimate T) fail to satisfy the constraint, or (if we overestimate T) fail to converge 
because the monotonic relationship no longer hold. 

By Lemma[l]in AppendixJAJit follows that \2{FF^CgFF^) = \2{Cg^ojYY^) when uj ^ 00. 
Since the latter term is a sum of two PSD matrices, the value of the upper bound can only increase as 
stated by Lemma [2] in Appendix [Aj This is an important property, because if we do not recalculate 
T, the previous value is guaranteed to be an underestimate, meaning that the objective will remain 
convex. Thus, it may be more efficient to first recompute T when the binary search fails to satisfy 
(x^Dgs)^ = K, meaning that T must be recomputed to increase the search range. 

We compute the value for the upper bound of the binary search by transforming the problem in 
such a way that we can determine the greatest eigenvalue of a new system (fast and robust), and 
from that, deduce the new value of T = \2{FF^ CqFF^). We do so by expanding the expression 
as 

FF^CqFF^ = FF^ (l - D'^^'^AgD'^^'^) FF^ 



FpT _ fF^dJ''^AgDJ''^FF^ 

FF^D~^'^AgD~^'^FF^ + YY^ 

Since all columns of Y will be eigenvectors of FF^CgFF^ with zero eigenvalue, these will all be 
eigenvectors of FF^Dq ^ AgDq ^ FF^ + YY^ with eigenvalue 1. Hence, the largest algebraic 
eigenvalue \i^x{FF^ Dq ^ AgDq ^ FF^) can be used to compute the upper bound for the binary 
search as 

T = X2{FF^CgFF^) = 1 - \i^a{FF^D-'/^AgD-'/^FF^). (8) 

The reason for not considering the largest magnitude eigenvalue, is that Ag may be indefinite. 
Finally, with respect to our implementation we emphasize that FF^ is used as a projection operator, 
and not represented explicitly. 

4 Extension of Our Main Algorithm and Implementation Details 

In this section, we present two variants of our main algorithm that are more well-suited for very 
large-scale applications; the first uses a column-based low-rank approximation, and the second 



uses random walk ideas. In Section 4.1, we describe how to use the Nystrom method, which 
constructs a low-rank approximation to the kernel matrix by sampling columns, to construct a 
general solution for semi-supervised eigenvectors, where the low-rankness is exploited for very 



efficient computation. Then, in Section |4.2| we describe a "Push-peeling heuristic," based on 
the efficient Push algorithm by [Ij. The basic idea is that if, rather than iteratively computing 
locally-biased semi-supervised eigenvectors using the procedure described in Algorithmjl] we instead 
compute solutions to LocalSpectral and then construct the semi-supervised eigenvectors by 
"projecting away" pieces of these solutions, then we can take advantage of local random walks that 
have improved algorithmic properties. 
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4.1 A Nystrom-based Low-rank Approach 

Here we describe the use of the recently-popular Nystrom method to speed up the computation 
of semi-supervised eigenvectors. We do so by considering how a low-rank decomposition can be 
exploited to yield solutions to the Generalized LocalSpectral objective in Figure [TJ where 
the running time largely depends on a matrix- vector product. These methods are most appropriate 
when the kernel matrix is reasonably well-approximated by a low-rank matrix [151 ESI [6l] . 

Given some low-rank approximation Cq ^ I — VAV^ , we apply the Woodbury matrix identity, 
and we derive an explicit solution for the leading semi-supervised eigenvector 

yi^c{{l--f)I-VAV^)^Dl/^s 

where T^a = i_^ In order to compute efficiently the subsequent semi-supervised eigenvectors we 

must accommodate for the projection operator FF^ — I — YY^ , while yet exploiting the explicit 
closed-form inverse (Co — ")I)^ ^ y^ (^ ^ VTy^^. However, the projection operator complicates 
the expression, since the previous solution can be spanned by multiple global eigenvectors, so 
leveraging from the low-rank decomposition is more difficult for the inverse {F F^ (Co — "fl) F F^)^ . 
Conveniently, we can decouple the projection operator by treating the orthogonality constraint 
using a Lagrangian approach, such that the solution can be expressed as 

yt = c{Cg-iI + ujYY^y D^fs, 

where u > denotes the associated Lagrange multiplier, and where the sign is deduced from the 
KKT conditions. Applying the Woodbury matrix identity is now straightforward 

(P^ + (jjYY^)^ = P^+ - ujP^^Y (/ + ujY^P^^Y)^ Y^P^^, 

where for notational convenience we have introduced P^ = Co — ")F By decomposing Y^ P^^Y 
with an eigendecomposition USU^ the equation simplifies as follows 

(P^ + uYY^)"" = P^+ - cjP^+y (/ + ujUSU^)^ Y^P^^ 

= P^+ - p^^Yunu^Y^p^^, 

where flu = i ^ . Note how this result gives a well defined way of controlling the amount 

i^^ ' 1-1. .1—1 

of "orthogonality" , and by Lemma [1] in Appendix |Aj we get exact orthogonality in the limit of 

cj ^ oc, in which case the expression simplifies to 



(p^ + (jjYY^)^ = p^+ - p^+y(y^p^+y)+y^p^+. 



Using the explicit expression for Py^, the solution now only involves matrix- vector products and 
the inverse of a small matrix 

yt = c (P^+ - P,+Y{Y^P,+Y)+Y^P^+) ofs. (9) 
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To conclude this section, let us also consider how we can optimize the efficiency of the calculation 
of \2{FF^CgFF^) used for bounding the binary search in Algorithm [ij According to Eqn. ([s]) 



the bound can be calculated efficiently as T = 1 — \i^x{FF^Dq ' AqDq ' FF^). However, by 
substituting with Dq ^ AqDq ^ '^ VKV^ ^ we can exploit low-rankness since 

T = 1 - Xi^a{FF^VAV^FF^) = 1 - Ala(A^/V^FF^1/A1/2)^ 
where the latter is a much smaller system. 

4.2 A Push-peeling Heuristic 

Here we present a variant of our main algorithm that exploits the connections between diffusion- 
based procedures and eigenvectors, allowing semi-supervised eigenvectors to be efficiently computed 
for large networks. This is most well-known for the leading nontrivial eigenvectors of the graph 
Laplacian |12| ; but recent work has exploited these connections in the context of performing locally- 
biased spectral graph partitioning [55l[Tl|32|. In particular, we can compute the locally-biased vector 
using the first step of Algorithm [iJ or alternatively we can compute it using a locally-biased random 
walk of the form used in |55( [J . Here we present a heuristic that works by peeling off components 
from a solution to the PageRank problem, and by exploiting the regularization interpretation of 7, 
we can from these components obtain the subsequent semi-supervised eigenvectors. 

Specifically, we focus on the Push algorithm by [Ij. This algorithm approximates the solution to 
PageRank very efficiently, by exploiting the local modifications that occur when the seed is highly 
concentrated. This makes our algorithm very scalable and applicable for large-scale data, since 
only the local neighborhood near the seed set will be touched by the algorithm. In comparison, by 
solving the linear system of equations we explicitly touch all nodes in the graph, even though most 
spectral rankings will be below the computational precision [8]. 

We adapt a similar notation as in [1] and start by defining the usual PageRank vector pr(a, 5pr) 
as the unique solution of the linear system 

pr(a, 5pr) = Q^v + (1 - a)AGD^^pT{a, 5pr), (10) 

where a is the teleportation parameter, and Spj- is the sparse starting vector. For comparison, the 
push algorithm by [Ij computes an approximate PageRank vector pT^{a\ Spj-) for a slightly different 
system 

pT^(a\ 5pr) = a^Spr + (1 - a^)WpT^{a\ 5pr), 



where VF = |(/ + AgDq^) and not the usual random walk matrix ADq^ as used in Eqn. (10). 
However, these equations are only superficially different, and equivalent up to a change of the 
respective teleportation parameter. Thus, it is straightforward to verify that these teleportation 
parameters and the 7 parameter of Eqn. ([g]) are related as 

2a' , a I 1 

a — 7 <^ a — <^ a 



l^a' 2- a 7-2' 

and that the leading semi-supervised eigenvector for 7 G (— oc,0) can be approximated as 



xl^^^D-^\r,[-L-,Dasy 
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To generalize subsequent semi-supervised eigenvectors to this diffusion based framework, we need 
to accommodate for the projection operator such that subsequent solutions can be expressed in 
terms of graph diffusions. By requiring distinct values of 7 for all semi-supervised eigenvectors, we 
may use the solution for the leading semi-supervised eigenvector and then systematically "peel off" 
components, thereby obtaining the solution of one of the consecutive semi-supervised eigenvectors. 
By Lemma [5J in Appendix |A] the general solution in Eqn. ^ can be approximated by 

xt ^c{l- XX^Dg) {Lg - 7tDG)+DGS, (11) 

under the assumption that all 7^^^^ for 1 < fc < t are sufficiently apart. If we think about 7/^ as 
being distinct eigenvalues of the generalized eigenvalue problem LcXk = ^kDcXk, then it is clear 
that Eqn. ( pTj ), correctly computes the sequence of generalized eigenvectors. This is explained 
by the fact that (Lq — ^tDG)^DGS can be interpreted as the first step of the Rayleigh quotient 
iteration, where 7^ is the estimate of the eigenvalue, and Dqs is the estimate of the eigenvector. 
Given that the estimate of the eigenvalue is right, this algorithm will in the initial step compute 
the corresponding eigenvector, and the operator (/ — XX^Dg) will be superfluous, as the global 
eigenvectors are already orthogonal in the degree-weighted norm. To quantify the failure modes of 
the approximation, let us consider what happens when 72 starts to approach 71. What constitutes 
the second solution for a particular value of 72 is the perpendicular component with respect to the 
projection onto the solution given by 71. As 72 approaches 71, this perpendicular part diminishes 
and the solution becomes ill-posed. Fortunately, we can easily detect such issues during the binary 
search in Algorithm [l} and in general the approximation has turned out to work very well in practice 
as our experimental results in Section [5] show. 

In terms of the approximate PageRank vector pT^(a\spj:) , the general approximate solution 
takes the following form 

x; « c (/ - XX^Dg) D^'pr, (^;^, Dgs^ . (12) 



As already stated in Section [3^ the impact of using a diffusion based procedure is that we cannot 
interpolate all the way to the global eigenvectors, and that the main challenge is that the solutions 
do not become too localized. The e parameter of the Push algorithm controls the threshold for 
propagating mass away from the seed set and into the adjacent nodes in the graph. If the threshold 
is too high, the solution will be very localized and make it difficult to find more than a few semi- 
supervised eigenvectors, as characterized by Lemma |3] in Appendix |A| because the leading ones will 
then span the entire space of the seed set. As the choice of e is important for the applicability of 
our algorithm, we will in Section [5] investigate the influence of this parameter on large data graphs. 
To conclude this section, we consider an important implementation detail that have been omit- 
ted so far. In the work of [37] the seed vector was defined to be perpendicular to the all-ones vector, 
and for the sake of consistency we have chosen to define it in the same way. The impact of pro- 
jecting the seed set to a space that is perpendicular to the all-ones vector is that the resulting seed 
vector is no longer sparse, making the use of the Push algorithm in Eqn. ( [T2| ) inefficient. The seed 
vector can, however, without loss of generality, be defined as 5 oc Dq ' [I — '^o'^o^) <so where sq is 
the sparse seed, and v^ ex diag ( Dq j is the leading eigenvector of the normalized graph Laplacian 
(corresponding to the all-ones vector of the combinatorial graph Laplacian). If we substitute with 
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this expression for the seed in Eqn. (12), it follows by plain algebra (see Appendix [B]) that 



xt ^c{l- XX^Dg) (^D^'pr, (^-^, d'J^o^ - D^'^^qv^sq^ . (13) 

Now the Push algorithm is only defined on the sparse seed set making the the expression very 
scalable. Finally, the Push algorithm maintains a queue of high residual nodes that are yet to 
be processed. The order in which nodes are processed influences the overall running time, and in 
[8] preliminary experiments showed that a FIFO queue resulted in the best performance for large 
values of 7, as compared to a priority queue that scales logarithmically. For this reason we have 
chosen to use a FIFO queue in our implementation. 

5 Empirical Results 

In this section, we provide a detailed empirical evaluation of the method of semi-supervised eigen- 
vectors and how it can be used for locally-biased machine learning. Our goal is two- fold: flrst, 
to illustrate how the "knobs" of the method work; and second, to illustrate the usefulness of the 
method in real applications. To do this, we consider several classes of data. 



• Toy data. In Section 5.1, we consider one-dimensional examples of the popular "small 
world" model [62j. This is a parameterized family of models that interpolates between low- 
dimensional grids and random graphs; and, as such, it allows us to illustrate the behavior of 
the method and its various parameters in a controlled setting. 



• 



Congressional voting data. In Section 5.2, we consider roll call voting data from the 



United States Congress that are based on [49]. This is an example of reahstic data set that 
has relatively-simple global structure but nontrivial local structure that varies with time [14] : 
and thus it allows us to illustrate the method in a realistic but relatively-clean setting. 



Handwritten image data. In Section |5.3[ we consider data from the MNIST digit data 
set [34j. These data have been widely-studied in machine learning and related areas and 
they have substantial "local heterogeneity." Thus, these data allow us to illustrate how the 
method may be used to perform locally-biased versions of common machine learning tasks 
such as smoothing, clustering, and kernel construction. 



• 



Functional brain imaging data. In Section 5.4, we consider functional magnetic resonance 



imaging (fMRI) data. Single subject fMRI data is characterized by high dimensionality and 
relatively few samples, in contrast to the MNIST data that consist of many samples and a 
relatively low dimensionality. We demonstrate how our semi-supervised eigenvectors can be 
applied to construct a data-driven spatially-biased basis by incorporating a priori knowledge 
from a functional brain atlas [17j. 



Large-scale network data. In Section 5.5, we consider large-scale network data, and 



demonstrate signiflcant performance improvements of the push-peeling heuristic compared 
to solving the same equations using a conjugate gradient solver. These improvements are 
demonstrated on datasets from the DIMACS implementation challenge, as well as on large 
web-crawls with more then 3 billion non-zeros in the adjacency matrix 
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5.1 Small- world Data 

The first data sets we consider are networks constructed from the so-called small-world model. This 
model can be used to demonstrate how semi-supervised eigenvectors focus on specific target regions 
of a large data graph to capture slowest modes of local variation; and it can also be used to illustrate 
how the "knobs" of the method work, e.g., how hc and 7 interplay, in a practical setting. In Figure[3| 
we plot the usual global eigenvectors, as well as locally-biased semi-supervised eigenvectors, around 
illustrations of non-rewired and rewired realizations of the small- world graph, i.e., for different 
values of the rewiring probability p and for different values of the locality parameter k. 

To start, in Figure [3(a)| that we show a graph with no randomly-rewired edges {p = 0) and 
a parameter k such that the global eigenvectors are obtained. This yields a symmetric graph 
with eigenvectors corresponding to orthogonal sinusoids, i.e., the first two capture the slowest 
mode of variation and correspond to a sine and cosine with equal random phase-shift (up to a 
rotational ambiguity). In Figure [3(b)[ random edges have been added with probability p = 0.01 
and the parameter k. is still chosen such that the global eigenvectors — now of the rewired graph — 
are obtained. Note the many small kinks in the eigenvectors at the location of the randomly added 
edges. Note also the slow mode of variation in the interval on the top left; a normalized-cut based 
on the leading global eigenvector would extract this region, since the remainder of the ring is more 
well-connected due to the random rewiring. 

In Figure 3(c)[ we see the same graph realization as in Figure |3(b)[ except that the semi- 



supervised eigenvectors have a seed node at the top of the circle, i.e., at "12 o-clock," and the 
locality parameter Hit = 0.005, which corresponds to moderately well-localized eigenvectors. As 
with the global eigenvectors, the locally-biased semi-supervised eigenvectors are of successively- 
increasing (but still localized) variation. Note also that the neighborhood around "11 o-clock" 
contains more mass, e.g., when compared with the same parts of the circle in Figure [3(b) | or with 
other parts of the circle in Figure [S^cJ , even though it is not very near the seed node in the original 



graph geometry. The reason for this is that this region is well-connected with the seed via a 
randomly added edge, and thus it is close in the modified graph topology. Above this visualization, 
we also show the value of 7^ that saturates /^t? i-^-^ It is the Lagrange multiplier that defines the 
effective locality /^^ Not shown is that if we kept reducing nt, then 7^ would tend towards A^+i, 
and the respective semi-supervised eigenvectors would tend towards the global eigenvectors that 
are illustrated in Figure [3'(b)[ Finally, in Figure [3'(d)[ the desired locality is increased to /^ = 0.05 
(which has the effect of decreasing the value of 7^), making the semi-supervised eigenvectors more 
localized in the neighborhood of the seed. It should be clear that, in addition to being determined 
by the locality parameter, we can think of 7 as a regularizer biasing the global eigenvectors towards 
the region near the seed set. That is, variation in eigenvectors that are near the initial seed (in the 
modified graph topology) are most important, while variation that is far away from the initial seed 
matters much less. 

5.2 Congressional Voting Data 

The next data set we consider is a network constructed from a time series of roll call voting patterns 
from the United States Congress that are based on [49j . This is a particularly well-structured social 
network for which there is a great deal of meta-information, and it has been studied recently with 
graph-based methods |30l [631 El- Thus, it permits a good illustration of the method of semi- 
supervised eigenvectors in a real application [l8j. This data set is known to have nontrivial time- 



17 



A2 = 0.000011, A3 = 0.000011, 

A4 = 0.(DW}'™'«'«, As = 0.000046. 



p = om, 

A2 = 0.000149, A3 = 0.000274, 

A4 = § JTO31S, A5 = 0.000489. 




(a) Global eigenvectors (p = 0) 

p = 0.01, K = 0.005, 
71 = 0.000047, 72 = 0.000052, 

73 = =0 JTOOOO, 74 = -0.000000. 




(b) Global eigenvectors (p = 0.01) 

p = 0.01, K^O.OS, 
71 = -0.004367, yo -0.001778, 
I, 74 = -0.000822. 





(c) Semi-supervised eigenvectors 



(d) Semi-supervised eigenvectors 



Figure 3: Illustration of small- world graphs with rewiring probability oi p = or^p = 0.01 and 
with different values of the /^ parameter. For each subfigure, the data consist of 3600 nodes, each 
connected to it's 8 nearest-neighbors. In the center of each subfigure, we show the nodes ( ) and 
edges (black and light gray are the local edges, and are the randomly-rewired edges). We wrap 

around the plots (black x-axis and gray background), visualizing the 4 smallest semi-supervised 
eigenvectors. Eigenvectors are color coded as blue, red, jallow, and green, starting with the one 
having the smallest eigenvalue. 
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varying structure at different time steps, and we will illustrate how the method of semi-supervised 
eigenvectors can perform locally-biased classification with a traditional kernel-based algorithm. 

In more detail, we evaluate our method by considering the known Congress data-set containing 
the roll call voting patterns in the U.S Senate across time. We considered Senates in the 70^^ 
Congress through the 110*^ Congress, thus covering the years 1927 to 2008. During this time, the 
U.S went from 48 to 50 states, hence the number of senators in each of these 41 Congresses was 
roughly the same. We constructed an A^ x A^ adjacency matrix, with N = 4196 (41 Congresses 
each with ^ 100 Senators) where Aij G [0, 1] represents the extent of voting agreement between 
legislators i and j, and where identical senators in adjacent Congresses are connected with an inter- 
Congress connection strength. We then considered the Laplacian matrix of this graph, constructed 
in the usual way [14j . 
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Figure 4: Shows the Congress adjacency matrix, along with four of the individual Congresses. Nodes 
are scaled according to their degree, blue nodes correspond to Democrats, red to Republicans, and 
green to Independents. 

Figure |4] visualizes the adjacency matrix, along with four of the individual Congresses, color 
coded by party. This illustrates that these data should be viewed — informally — as a structure 
(depending on the specific voting patterns of each Congress) evolving along a one-dimensional 
temporal axis, confirming the results of [Hj. Note that the latter two Congresses are significantly 
better described by a simple two-clustering than the former two Congresses, and an examination 
of the clustering properties of each of the 40 Congresses reveals significant variation in the local 
structure of individual Congresses, in a manner broadly consistent with [48j and [49j. In particular, 
the more recent Congresses are significantly more polarized. 

The first vertical column of Figure [5] illustrates the first three global eigenvectors of the full data 
set, illustrating fluctuations that are sinusoidal and consistent with the one-dimensional temporal 
scaffolding. Also shown in the first column are the values of that eigenfunction for the members 
of the 99^^ Congress, illustrating that there is not a good separation based on party affiliations. 
The next three vertical columns of Figure [5] illustrate various localized eigenvectors computed by 
starting with a seed node in the 99^^ Congress. For the second column, we visualize the semi- 
supervised eigenvectors for a very low correlation (hc = 0.001), which corresponds to only a weak 
localization — in this case one sees eigenvectors that look very similar to the global eigenvectors, and 
the elements of the eigenvector on that Congress do not reveal partitions based on the party cuts. 

The third and fourth column of Figure [5] illustrate the semi-supervised eigenvectors for a much 
higher correlation {hc = 0.1), meaning a much stronger amount of locality. In particular, the third 
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Figure 5: First column: The leading three nontrivial global eigenvectors. Second column: The 
leading three semi-supervised eigenvectors seeded (circled node) in an articulation point between 
the two parties in the 99^^ Congress (see Figure [4|), for correlation /^ = 0.001. Third column: 
Same seed as previous column, but for a correlation of a^ = 0.1. Notice the localization on the 
third semi-supervised eigenvector. Fourth column: Same correlation as the previous column, but 
for another seed node well within the cluster of Republicans. Notice the localization on all three 
semi-supervised eigenvectors. 

column starts with the seed node marked A in Figure [4| which is at the articulation point between 
the two parties, while the fourth column starts with the seed node marked S, which is located 
well within the cluster of Republicans. In both cases the eigenvectors are much more strongly 
localized on the 99^^ Congress near the seed node, and in both cases one observes the partition 
into two parties based on the elements of the localized eigenvectors. Note, however, that when the 
initial seed is at the articulation point between two parties then the situation is much noisier: in 
this case, this "partitionability" is seen only on the third semi-supervised eigenvector, while when 
the initial seed is well within one party then this is seen on all three eigenvectors. Intuitively, 
when the seed set is strongly within a good cluster, then that cluster tends to be found with semi- 
supervised eigenvectors (and we will observe this again below). This is consistent with the diffusion 
interpretation of eigenvectors. This is also consistent with [M], who observed that the properties 
of eigenvector localization depended on the local structure of the data around the seed node, as 
well as the larger scale structure around that local cluster. 
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Figure 6: Classification accuracy measured in individual Congresses. For each Congress we perform 
5-fold cross validation based on ^ 80 samples and leave out the remaining 20 samples to estimate 
an unbiased test error. Error bars are obtained by resampling and they correspond to 1 standard 
deviation. For each approach we consider features based on the 1^^ (blue), 2^^ (green), and 3^^ 
(red) smallest eigenvector(s), excluding the all-one vector. We also plot the probability of the most 
probable class as a baseline measure (black) as some Congresses are very imbalanced. 

To illustrate how these structural properties manifest themselves in a more traditional machine 
learning task, we also consider the classification task of discriminating between Democrats and 
Republicans in single Congresses, i.e., we measure to what extent we can extract local discriminative 
features. To do so, we apply L2-regularized L2-I0SS support vector classification with a linear 
kernel, where features are extracted using the global eigenvectors of the entire data set, global 
eigenvectors from a single Congress (best case measure), and our semi-supervised eigenvectors. 
Figure [g] illustrates the classification accuracy for 1, 2, and 3 eigenvectors. As reported by [T4j, 
locations that exhibit discriminative information are best found on low-order eigenvectors of this 
data, explaining why the classifier based global eigenvectors performs poorly. In the classifier 
based on global eigenvectors in the single Congress we exploit a priori knowledge to extract the 
relevant data, that in a usual situation would be impossible. Hence, this is simply to define 
a baseline point of reference for the best case classification accuracy. The classifier based on 
semi-supervised eigenvectors is seeded using a few training samples and performs in-between the 
two other approaches. Compared to our point of reference. Congresses in the range 88 to 96 do 
worse with the semi-supervised eigenvectors; whereas for Congresses after 100 the semi-supervised 
approach almost performs on par, even for a single single eigenvector. This is consistent with the 
visualization in Figure |4] illustrating that earlier Congresses are less cleanly separable, as well as 
with empirical evidence indicating heterogeneity due to Southern Democrats in earlier Congresses 
and the recent increase in party polarization in more recent Congresses, as described in [48j and [49] . 

5.3 MNIST Digit Data 

The next data set we consider is the well-studied MNIST data set containing 60, 000 training 
digits and 10, 000 test digits ranging from to 9; and, with these data, we demonstrate the use 
of semi-supervised eigenvectors as a feature extraction preprocessing step in a traditional machine 
learning setting. We construct the full 70, 000 x 70, 000 fc-NN graph, with A: = 10 and with edge 
weights given by Wij = exp( — ^||x^ — Xj|p), where af is the Euclidian distance of the i^^ node 
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to it's nearest neighbor; and from this we define the graph Laplacian in the usual way. We then 
evaluate the semi-supervised eigenvectors in a transductive learning setting by disregarding the 
majority of labels in the entire training data. We use a few samples from each class to seed our 
semi-supervised eigenvectors as well as a few others to train a downstream classification algorithm. 
For this evaluation, we use the Spectral Graph Transducer (SGT) of [29j; and we choose to use it 
for two main reasons. First, the transductive classifier is inherently designed to work on a subset 
of global eigenvectors of the graph Laplacian, making it ideal for validating that the localized 
basis constructed by the semi-supervised eigenvectors can be more informative when we are solely 
interested in the "local heterogeneity" near a seed set. Second, using the SGT based on global 
eigenvectors is a good point of comparison, because we are only interested in the effect of our 
subspace representation. (If we used one type of classifier in the local setting, and another in the 
global, the classification accuracy that we measure would obviously be confounded.) As in [29], we 
normalize the spectrum of both global and semi-supervised eigenvectors by replacing the eigenvalues 
with some monotonically increasing function. We use A^ = p, ^-c, focusing on ranking among 
smallest cuts; see [Hj. Furthermore, we fix the regularization parameter of the SGT to c = 3200, 
and for simplicity we fix 7 = for all semi-supervised eigenvectors, implicitly defining the effective 
n — [ni^ . . . ^KkY . Clearly, other correlation distributions n and other values of 7 parameter may 
yield subspaces with even better discriminative properties (which is an issue to which we will return 



in Section 5.3.2 in greater detail). 
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Table 1: Classification error for discriminating between 4s and 9s for the SGT based on, respectively, 
semi-supervised eigenvectors and global eigenvectors. The first column from the left encodes the 
configuration, e.^., 1:10 interprets as 1 seed and 10 training samples from each class (total of 22 
samples — for the global approach these are all used for training) . When the seed is well-determined 
and the number of training samples moderate (50:500), then a single semi-supervised eigenvector 
is sufficient; whereas for less data, we benefit from using multiple semi-supervised eigenvectors. All 
experiments have been repeated 10 times. 



5.3.1 Discriminating between pairs of digits 

Here, we consider the task of discriminating between two digits; and, in order to address a partic- 
ularly challenging task, we work with 45 and 95. (This is particularly challenging since these two 
classes tend to overlap more than other combinations since, e.^., a closed 4 can resemble a 9 more 
than an open 4.) Hence, we expect that the class separation axis will not be evident in the leading 
global eigenvector, but instead it will be "buried" further down the spectrum; and we hope to find 
a "locally-biased class separation axis" with locally-biased semi-supervised eigenvectors. Thus, this 
example will illustrate how semi-supervised eigenvectors can represent relevant heterogeneities in 
a local subspace of low dimensionality. See Table [l} which summarizes our classification results 
based on, respectively, semi-supervised eigenvectors and global eigenvectors, when we use the SGT. 
See also Figure [7| and Figure [8J which illustrate two realizations for the 1:10 configuration. In 
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these two figures, the training samples are fixed; and, to demonstrate the influence of the seed, 
we have varied the seed nodes. In particular, in Figure [7| the seed nodes 5+ and s_ are located 
well- within the respective classes; while in Figure [8| they are located much closer to the boudary 
between the two classes. As intuitively expected, when the seed nodes fall well within the classes 
to be differentiated, the classification is much better than when the seed nodes are located closer 
to the boundary between the two classes. See the caption in these figures for further details. 
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Figure 7: Discrimination between 4s and 9s. Left: Shows a subset of the classification results for 
the SGT based on 5 semi-supervised eigenvectors seeded in 5+ and 5_, and trained using samples 
/+ and /_. Misclassifications are marked with black frames. Right: Visualizes all test data spanned 
by the first 5 semi-supervised eigenvectors, by plotting each component as a function of the others. 
Red (blue) points correspond to 4 (9), whereas green points correspond to remaining digits. As the 
seed nodes are good representatives, we note that the eigenvectors provide a good class separation. 
We also plot the error as a function of local dimensionality, as well as the unexplained correlation, 
i.e., initial components explain the majority of the correlation with the seed (effect of 7 = 0). The 
particular realization based on the leading 5 semi-supervised eigenvectors yields an error of ^ 0.03 
(dashed circle). 



5.3.2 Effect of choosing the k correlation/locality parameter 

Here, we discuss the effect of the choice of the correlation/locality parameter n at different steps 
of Algorithm hi e.g.^ how {Hit}t=i should be distributed among the k components. For example, 
will the downstream classifier benefit the most from a uniform distribution or will there exist some 
other nonuniform distribution that is better? Although this will be highly problem specific, one 
might hope that in realistic applications the classification performance is not too sensitive to the 
actual choice of distribution. To investigate the effect in our example of discriminating between 
4s and 9s, we consider 3 semi-supervised eigenvectors for various hc distributions. Our results are 
summarized in Figure |9J 

Figures r9(a)[ |9(bJ , and |9(c)| show, for the global eigenvectors and for semi-supervised eigenvec- 
tors, where the k vector has been chosed to be very nonuniform and very uniform, the top three 
(global or semi-supervised) eigenvectors plotted against each other as well as the ROC curve for 
the SGT classifier discriminating between 4s and 9s; and Figure |9(d)| shows the test error as the 
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Figure 8: Discrimination between 4s and 9s. See the general description in Figure [7| Here we 
illustrate an instance where the 5+ shares many similarities with 5_, i.e., 5+ is on the boundary 
of the two classes. This particular realization achieves a classification error of ^ 0.30 (dashed 
circle). In this constellation we first discover localization on low order semi-supervised eigenvectors 
{^ 12 eigenvectors), which is comparable to the error based on global eigenvectors (see Tablejl]), i.e., 
further down the spectrum we recover from the bad seed and pickup the relevant mode of variation. 



Hi vector is varied over the unit simplex. In more detail, red (respectively, blue) corresponds to 4s 
(respectively, 9s), and green points are the remaining digits; and, for Figures [9(b) and |9(c){ the 
semi-supervised eigenvectors are seeded using 50 samples from each target class (4s vs. 9s) and 
having a non- uniform distribution of a^, as specified. As seen from the visualization of the semi- 
supervised eigenvectors in Figures |9(b)| and |9(c)| the classes are much better separated than by 
using the global eigenvectors, which are shown in Figure [9(a)} For example, this is supported by the 
Area Under the Curve (AUG) and Error Rate (ERR), being the point on the Receiver Operating 
Characteristic (ROC) curve that corresponds to having an equal probability of miss-classifying a 
positive or negative sample, which is a fair estimate as the classes in the MNIST data set is fairly 
balanced. For Figure [9(c) , where we use a uniform distribution of k, the classifier performs slightly 
better than in Figure 9(b)[ which uses the non-uniform Hi distribution (but both semi-supervised 
approaches are significantly better than the using the global eigenvectors). For Figure [9(d) [ we 
see the test error on the simplex defined by Hi. To obtain this plot we sampled 500 different Hi 
distributions according to a uniform Dirichlet distribution. With the exception of one extreme 
very nonuiform corner, the classification accuracy is not too sensitive to the choice of k distribu- 
tion. Thus, if we think of the semi-supervised eigenvectors as a locally-regularized version of the 
global eigenvectors, the desired discriminative properties are not too sensitive to the details of the 
locally-biased regularization. 



5.3.3 Effect of approximately computing semi-supervised eigenvectors 

Here, we discuss of the push-peeling procedure from Section [4] that is designed to compute efficient 
approximations to the semi-supervised eigenvectors by using local random walks to compute an 



approximation to personalized PageRank vectors. Consider Figure [TOj which shows results for two 
values of the e parameter (i.e., the parameter in the push algorithm that implicitly determines 
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Figure 9: The effect of varying the correlation/locahty parameter /^ on the classification accuracy. 
9(a), |9(b) , |9(c) show the top three (global or semi-supervised) eigenvectors plotted against each 
other as well as the ROC curve for the SGT classifier discriminating between 4s and 9s; and |9(d)| 
shows the test error as the n vector is varied over the unit simplex. 



how many nodes will be touched). Again we construct the full 70, 000 x 70, 000 fc-NN graph, with 
A: = 10 and with edge weights given by wij — exp(— ^||x^ — Xj|p), where af is the Euclidian 

i 

distance of the i^^ node to it's nearest neighbor; and from this we define the graph Laplacian in 
the usual way. Using this representation we compute 3 semi-supervised eigenvectors seeding using 
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50 samples from each class (4s vs. 9s). However, in this case, we fix the regularization parameter 
vector as 7 = [—0.0150, —0.0093, — vol(G)]; and note that choosing these specific values correspond 
to the solutions visualized in Figure [9(c)] when the equations are solved exactly. Figure 10(a) shows 



the results for e = 0.001. This approximation gives us sparse solutions, and the histogram in the 
second row illustrates the digits that are assigned a nonzero value in the respective semi-supervised 
eigenvector. In particular, note that most of the mass of the eigenvector is distributed on 4s and 
9s; but, for this choose of e, only few digits of interest (^ 2.8243%, meaning, in particular, that 
not all of the 4s and 9s) have been touched by the algorithm. This results in the lack of a clean 
separation between the two classes as one sweeps along the leading semi-supervised eigenvector, as 
illustrated in the first row; the very uniform correlation distribution n — [0.8789,0.0118,0.1093]; 
and the high classification error, as shown in the ROC curve in the bottom panel. 

Consider, next. Figure [TO(b)| which shows the results for e = 0.0001, ie., where the locality 
parameter e has been reduced by an order of magnitude. In this case, the algorithm reproduces 
the solution by touching only ^ 25.177% of the nodes in the graph, i.e., basically all of the 4s and 
9s and only a few other digits. This leads to a much cleaner separation between the two classes as 
one sweeps over the leading semi-supervised eigenvector; a much more uniform distribution over /^; 
and a classification accuracy that is much better and is similar to what we saw in Figure [9(c)| This 
example illustrates that this push-peeling approximation provides a principled manner to generalize 
the concept of semi-supervised eigenvectors to large-scale settings, where it will be infeasible to 
touch all nodes of the graph. 

5.3.4 Effect of low-rank Nystrom approximation 

Here we discuss the use of the low-rank Nystrom approximation which is commonly used in large- 
scale kernel-based machine learning. The memory requirements for representing the explicit kernel 
matrix, that we here take to be our graph, scales with (9(A^^), whereas inverting the matrix scales 
with (9(A^^), which, in large-scale settings, is infeasible. The Nystrom technique subsamples the 
dataset to approximate the kernel matrix, and the memory requirements scales with 0{nN) and 
runs in 0{n?N), where n is size of the subsample. For completeness we include the derivation of 
the Nystrom approximation for the normalized graph Laplacian in Appendix [C] 



In the beginning of Section [5^ we constructed the 70,000 x 70,000 fc-nearest neighbor graph, 
with A: = 10 and with edge weights given by Wij = exp( — ^||x^ — Xj|p). Such a sparse construction 

i 

reduces the effect of "hubs" , as well as being fairly insensitive to the choice of kernel parameter, as 
the 10 nearest neighbors are likely to be very close in the Euclidian norm. Because the Nystrom 
method will approximate the dense kernel matrix, the choice of kernel parameter is more important, 
so in the following we will consider the interplay between this parameter, as well as the rank 
parameter n of the Nystrom approximation. Moreover, to allow us to compare a rank-n Nystrom 
approximation with the full rank-A^ kernel matrix, we choose to subsample the dataset for all of the 
following experiments, due to the 0{N'^) memory requirements. Thus, to provide a baseline. Figure 



11 shows results based on a fc-nearest neighbor graph constructed from 5% and 10% percent of the 
training data, where in both cases we used 10% for the test data. For both cases, when compared 
with the results of Figure 9(c)| the classification quality is degraded, and so we emphasize that the 



goal of the following results are not to outperform the results reported in Figure 9(c)| but to be 
comparable with this baseline. 



In light of this baseline. Figure 12 provides a thorough analysis for the choices of a that we used. 



Figures [l 2 (a) I and |1 2(b) I show the classification error when using the global eigenvectors, for various 
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(a) Locality parameter e = 0.001 



(b) Locality parameter e = 0.0001 



Figure 10: Illustration of the push-peeling procedure to compute 3 semi-supervised eigenvectors 
for 7 = [-0.0150, -0.0093, -vol(G)]. p^(a)] shows results for e = 0.001; and [Tn(b)] shows results for 
e = 0.0001. First row shows the entries in the leading semi-supervised eigenvector corresponding 
to test points, color-coded and sorted according to magnitude; second row shows the distribution 
of digits touched in the full graph when executing the push algorithm; and bottom panels provide 
visualizations similar to the ones in Figure M (and shown above these is the correlation vector k 
obtained for the fixed choice of 7. 
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Figure 11: Example of the impact of subsampling the data set down to 5% (in|ll(a)| ) and 10% (in 
ll(b)| ) of the original size. Remaining parameters are the same as in Figure 9(c)[ which shows the 
result to which these two plots should be compared. 



rank approximations based on the Nystrom method as well as the exact method (corresponding to 
rank = n). Interestingly, these two plots are very dissimilar in terms of their behavior as a function 
of the number of components. In particular, the plot in Figure |12(b)| shows that the low rank 
approximations for a given set of components outperform the high rank approximations, and the 
exact representation fails to reduce the error beyond 0.4 for any of the considered set of components. 
This may seem counterintuitive, but the reason for this type of behavior is that the relevant global 
eigenvectors, for a? = 200, are located far from the end of the spectrum (if we visualized more 
components for rank = n the classification error would eventually drop). For the same reason, the 
low rank approximations improve more rapidly than the high rank approximations, as the latter 
approximate the lower part of the spectrum better, and these turn out to have poor discriminative 



properties. In contrast, the results shown in Figure 12(a) provide good class separation in the lower 



part of the spectrum, resulting in the high rank approximations to reduce the error most rapidly. 

Finally, Figures p. 2(c) and 12(d) [ show the classification error for the SGT trained using the semi- 
supervised eigenvectors. (Note that the scale of the x-axis is much smaller in these subfigures.) For 
both kernel widths (in both Figures 12(c) and |12(d)] ), the ordering of the approximations are similar, 
i.e., the semi-supervised eigenvectors constructed from the rank = n approximation performs the 
best. Moreover, the gap between the rank = 400 and rank = n is largest for a^ = 200, again 
suggesting this approximation is of insufficient rank to model the relevant local heterogeneities 
deep down in the spectrum; whereas for a? = 80, the rank = 400 the approximation comes very 
close to the exact representation, suggesting that local structures are well modeled near the end of 
the spectrum. 

To summarize these results, the method of semi-supervised eigenvectors successfully extracts 
relevant local structures to perform locally-biased classification, even when they are located far 
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from the end of the spectrum. Moreover, in both cases we considered, the classification error is 
reduced significantly by using only a few locally-biased components. This contrasts with the global 
eigenvectors, where for erf — 80 at least 20 eigenvectors are needed in order to obtain similar 
performance; and for a? = 200, the classification error remains high even for 200 eigenvectors in 
case of rank = n. 
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Figure 12: We consider 10% of the MNIST training and test data and investigate the classification 
accuracy of a downstream SGT classifier for various approximations of the dense similarity matrix. 
12(a)| and |12(b)[ Classification error for the SGT evaluated directly on global eigenvectors, based on 
various Nystrom approximations and the two choices of the kernel width parameter (respectively, 
af = 80 and af — 200). 12(c) and 12(d): Classification error we have used the Nystrom approxi- 



mations as basis for computing semi-supervised eigenvectors that are then used in the downstream 
SGT classifier. All plots show the mean over 30 repetitions. 
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5.3.5 Implementation issues and running time considerations 

Here, we discuss implementation details and investigate the advantage of using the Graphics Pro- 
cessing Unit (GPU) for computing semi-supervised eigenvectors. Although the computations under- 
lying the construction of semi-supervised eigenvectors could be performed in many computational 
environments, the GPU architecture fits well with the dense semi-supervised eigenvector computa- 
tion in Eqn. Q; for each component, this expression will be executed (9(log2((A2(G) + vol(G)))/e) 
times within the binary search of Algorithm [Tj 

Compared to a Central Processing Unit (CPU), which is well-suited for processing code with a 
complex control flow, a GPU is much better suited for addressing problems that can be expressed 
as data-parallel computations with a high arithmetic intensity [33l[9l[25]. A GPU consists of a set 
of Multi Processors (MPs), each containing multiple Scalar Processors (SPs), as well as different 
types of local memories that the SPs may access. All MPs have also access to a large global memory 
that, compared to their internal memories, is much slower to access. A computation task to be 
executed on such a device is usually setup in a grid, where each element in the grid gets assigned to 
a thread. The grid is then decomposed into blocks that are scheduled onto the MPs with available 
resources, and the assigned MP will schedule the elements of the block onto its SPs in warps with 32 
threads. The best performance is obtained when all threads in a warp execute the same instruction 
and when the total number of threads in the grid is large, as this allows various latencies to be 
overlapped with arithmetic operations. 

We compare most recent CPU and GPU devices in computing the solution to Eqn. Q. In 
terms of the CPUs we test both consumer devices (GeForce) and professional devices (Tesla), where 
the latter provides enhanced performance for double-precision floating point arithmetic. For a fair 
comparison, we decided to rely on the BLAqj and CUBLAS implementations as used in Matlab 
2012b, i.e., avoiding to favor specific architecturally dependent implementation optimizations. 



since BLAS and CUBLAS should be optimal in terms of the underlying architecture. Figure [13 
shows performance measures (wall-clock-time as a function of the rank parameter) of CPU and 
GPU experiments. For single precision arithmetic the GTX 680 scales very well, and it ends up 
being more than three times faster than the 17-3820, as well as noticeably faster than the previous 
generation high-end Tesla C2070, and it even outperforms the latest generation Tesla K20c, as 



seen in Figure 13 (a) [ As seen in Figure p"3(a)| and 13(b)[ the CPUs perform poorly in the low-rank 



regime, and this is explained by the overhead of transferring data back and forth from the main 
memory and to the device. However, for the high-rank matrices the arithmetic intensity increases 
and the overhead is less dominant. Also evident is the performance improvement of the latest CPU 
generation (17-3820), that for the considered operation ends up being more than twice as fast as a 
previous generation E5620, that primarily is due to the higher clock frequency. For double precision 
arithmetic, the GTX 680 and GTX 590 are due to memory constraints stopped prematurely in the 
experiments, as they respectively are equipped with 2048MB and 1536MB (per GPU). Note that 
even though the GTX 590 is a dual GPU card, it is from the GPU computing perspective setup 
as two individual devices, and only one of these are used for the experiments. Interestingly the 
older GTX 590 outperforms the recent GTX 680, which may be explained by a higher memory 
bandwidth. In Figure |13(d)| the Tesla K20c outperforms all other devices by a fair margin, being 
^1.5 times as fast as the Tesla C2070, and four times faster than the 17-3820. 

Using GPU computing we are able to reduce the computation time considerably. Depending 



"^The BLAS implementation uses all physical CPU cores. 
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Figure 13: Running time performance measurements for solving Eqn. ([o]), given a specific value 
of the parameter 7, on the entire MNIST data set consisting of 70, 000 samples, as a function of 



the rank parameter. Single and double precision arithmetic results are respectively shown in |13(a) 
and 13(b) for the task of computing the 25*^ solution, z.e., constrained to be perpendicular to the 



previous 24 solutions. Similar does |13(c)| and |13(d)| show performance results for computing the 
500*^ solution, and here the advantage of using recent GPU architectures become even more evident, 
as the operation is dominated by a high arithmetic intensity that fit well with such architectures. 



on the application of the semi-supervised eigenvectors, the advantage may be significant, for exam- 
ple if applied in time critical applications such as online and real-time applications or large-scale 
simulations. 

5.4 Functional Magnetic Resonance Imaging Data 

The next dataset we consider is from functional Magnetic Resonance Imaging (fMRI). Here data 
analysis usually considers the characterization of relations between cognitive variables and indi- 
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vidual brain voxels, for instance using the mass-univariate General Linear Model (GLM), where 
statistical parametric maps are used to identify regions of gray matter that are significantly related 
to particular effects under study [19j. Even though such a voxel- wise univariate approach has been 
tremendously productive, there are obvious limits on what can be learned about cognitive states 
by only examining isolated voxels [42j. Multivariate methods have therefore paved the way for 
more advanced paradigms involving complex cognition, where the latent brain state cannot solely 
be determined from looking at individual voxel time series |16t [301 [?]• However, an immediate 
challenge for multivariate approaches is that weak signals carried by a sparse set of voxels can be 
very hard to detect, and for this reason multivariate approaches are often accompanied by spatial 
priors, to improve on the signal-to- noise ratio (SNR). 

Searchlight is an algorithm that scans through the whole brain by running multiple multivariate 
region-of-interest (ROI) analyses, measuring the respective generalization performance, and out- 
puts a brain map showing which regions exhibit the best discriminative properties, for example 
measured by classification accuracy for a particular subject task [32j. This approach was for in- 
stance applied by [27j, who used it to find regions in the brain that are predictive with respect to 
human intentions. Compared to a univariate approach, searchlight takes advantage of the power of 
multivariate techniques, with the caveat that it only performs well if the target signal is available 
within the area covered by the ROI. This limitation is indeed shared by the univariate approaches, 
but with searchlight we have the freedom to increase or decrease the ROI, depending on the struc- 
ture of the considered problem. If the ROI is small we approach a univariate analysis, whereas if 
the ROI is large, the information localization becomes less specific. Thus, if the multivariate signal 
is spatially distributed the searchlight approach will fall short, and simply increasing the ROI may 
not be a solution as irrelevant time series will decrease the SNR. 

The semi-supervised eigenvectors can be used to construct a spatially-guided basis that natu- 
rally allows for spatially distributed signal representations. This strategy shares many similarities 
with there searchlight approach, but it is not tied to a particular ROI, and it can span distributed 
voxel time series that are similar in terms of our graph representation. Using the semi-supervised 
eigenvectors on the voxel x voxel similarity graph in this way will yield a low dimensional repre- 
sentation that we can project the fMRI voxel time series onto, and in that projected space we can 
apply any suitable classification algorithm. 

We tested the method on Blood Oxygenation Level Dependent (BOLD) sensitive fMRI data 
acquired on a 3 Tesla MR scanner (Siemens Magnetom Verio). Additional sequence parameter were 
as follows: 25 interleaved echo planar imaging gradient echo slices, echo time 30 ms, repetition time 
1390 ms, flip angle 90 degrees. During the scanning session (1300 volumes) the subject was engaged 
in a simple motor paradigm in which the subject was asked to respond with key-presses when a 
visual cue was presented, and the classification task is to detect such key-presses. Pre-processing 
steps included: rigid body realignment, spatial smoothing (6 mm full width at half maximum 
isotropic Gaussian kernel), and high-pass filtering (cut-off frequency 1/128 Hz). See [56j for more 
details. 

We construct a voxel x voxel 10-nearest neighbor graph using the nonlinear affinity Wij = 



exp( — ||z^ — 2:j|| ). Figure 14 shows the 4 leading non-trivial global eigenvectors projected onto 
a sliced brain. Note that the first slice (top left) in such an image corresponds to the bottom 
of the brain, whereas the last slice (bottom right) corresponds to the top of the brain. The 
non-trivial global eigenvectors aim to span the most dominant sources of variation in the data, 
which in this particular dataset appears to stem mainly from the primary visual cortex (VI), and 
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(a) Global eigenvector, V2 




(c) Global eigenvector, V4 




(b) Global eigenvector, V3 




(d) Global eigenvector, v^ 



Figure 14: Visualization of the leading 4 nontrivial global eigenvectors. 

a frontal/posterior contrast apparent in the second global eigenvector. Importantly, the global 
eigenvectors are typically not associated with the interesting features of the task but rather general 
signal variation, which may be due to visual presentation of the stimuli (visual cortex) and often 
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(a) Primary Motor Cortex (PMC) 
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(b) Primary Auditory Cortex (PAC) 
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(c) Classification accuracy 



Figure 15: Figure [15 (a) | shows the seed region in PMC, and Figure 15(b)| shows the seed region 
in PAC. The plot in Figure 15(c)| shows the classification accuracy for the 5 different features 
extraction approaches. The dashed lines mark the reference where all voxel time series, as covered 
by the seed, are used in the downstream classifier, and the solid ones correspond to the accuracy 
obtained from projecting the data onto the semi-supervised eigenvectors seeded in PAC and PMC, 
as well as the global eigenvectors. 



physiological noise sources typically dominant in the lower slices of the brain near large arteries. 

Using a probabilistic functional atlas created by averaging across multiple subjects [T7], we 
carry out two experiments based on semi-supervised eigenvectors. Specifically, we construct semi- 
supervised eigenvectors seeded in Primary Motor Cortex (PMC), known to be highly involved in 
the subject task [21], as well as semi-supervised eigenvectors seeded in Primary Auditory Cortex 
(PAC), that is not expected to carry much signal with respect to our target variable [39j. The seed 
regions are highlighted in Figure |15(a)| and |15(b)[ 
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(b) Semi-supervised eigenvectors (PMC), X2 
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(c) Semi-supervised eigenvectors (PMC), xs 
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(d) Semi-supervised eigenvectors (PMC), X4 



Figure 16: Visualization of the leading 4 semi-supervised eigenvectors seeded in PMC, each corre- 



lating 0.25 with the seed, that is visualized in Figure 15(a) 



Figure [16] and [T7| shows respectively the leading 4 semi-supervised eigenvectors, each having a 
correlation of 0.25 with the seed, and respectively seeded in PMC and PAC. As expected the semi- 
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(c) Semi-supervised eigenvectors (PAC), xs 




(b) Semi-supervised eigenvectors (PAC), X2 
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Figure 17: Visualization of the leading 4 semi-supervised eigenvectors seeded in PAC, each corre- 
lating 0.25 with the seed, that is visuahzed in Figure 15(b)[ 



supervised eigenvectors are dominant near the seed region but are able to spread to related regions 
which carry information about important signal variation. For the PAC seed the first eigenvector 



36 



appears to capture the general pattern of signal variation in part of the cortex that focus on auditory 
processing. The remaining three eigenvectors appear to span specific signal variations in the PAC 
that are more specific to subregions with the auditory cortex. 

Likewise the first semi-supervised eigenvector from the seed in the PMC reveals other dominant 
parts of the motor network including the remaining parts of the PMC (posterior part of Brodmann 
area 4), somatosensory cortex (Brodmann areas 1,2 and 3) and the premotor cortex (Brodmann 
area 5). The remaining semi-supervised eigenvectors again focus on more localized sources of signal 
within these areas as well as signal variation in the primary visual cortex (Brodmann area 17), which 
is to be expected as the visual presentation of stimuli is related to motor function in the present 
task. 

For comparison in our classification task, we consider the leading global eigenvectors of the 
graph Laplacian, as well as simply extracting the time series as specified by the seed regions. For 
all of the considered feature extraction approaches we use either the projected or extracted time 
series as data for a linear SVM that is responsible for the downstream classification task. Figure 



15(c) summarizes the classification accuracies obtained by performing leave-one-out cross validation 
as a function of the number of components. For each semi-supervised eigenvector we fix a^ = | 
where k is the number of components. Hence, for two components, each correlates 0.5 with the 
seed, and so forth. In the same plot, the dashed blue line corresponds to classifying the brain state 
using only voxel time series in the region as defined by PAC. Unsurprisingly, for the dashed green 
line, corresponding to PMC, it is evident that the primary motor cortex is a much better proxy 
for predicting motor responses. Due to inter-subject variability there is no guarantee that the rigid 
body realignment will align the seed perfectly with the physical region, which explains why the 
data-driven global eigenvectors are able to yield an even higher accuracy than the PAC time series. 
Also seen is the "bump" in classification accuracy for the global eigenvectors, when we reach 4-5 
components. Thus, for this particular dataset, relevant parts of the are signal are captured in this 
regime. 

In the regime of few semi-supervised eigenvectors, the solutions are too localized to explain 
relevant local heterogeneities both near and within the seed set. As we increase the number of 
components they become less localized, and the semi-supervised eigenvectors seeded in PMC even- 
tually surpasses the accuracy of global approach. As we consider more and more components, 
while distributing the correlation evenly across the semi-supervised eigenvectors, they will eventu- 
ally converge to the global eigenvectors. Complementary, in the limit of a single component, the 
projection onto the leading trivial global eigenvector will simply correspond to the average time 
series, whereas for a leading semi-supervised eigenvector the solution is simply the seed itself, i.e., 
the projection onto this corresponds to a weighted average in the region defined by the seed. Hence, 
as seen in Figure 15(c)| there exists a regime in which the semi-supervised approach performs better 



as we are able to pickup the relevant local heterogeneities at that particular scale, given that the 
seed is relevant with respect to the subject task. 

5.5 Large-scale Network Data 

The final datasets we consider are from a collection of large sparse networks [44^^ |45l [46j. On these 



data, we demonstrate that the Push-peeling Heuristic introduced in Section |4.2| is attractive due 
to an improved running time, as compared to solving a system of linear equations. Moreover, we 
also show that the ability to obtain multiple semi-supervised eigenvectors depends on the degree 
heterogeneity near the seed. Finally, we empirically evaluate the influence of the e parameter of 
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the Push algorithm that imphcitly determines how many nodes the algorithm will touch. This 
parameter can be interpreted as a regularization parameter (different from 7 parameter), and 
setting it too large means we fail to distribute mass in the network, so that a few semi-supervised 
eigenvectors will consume all of the correlation. In particular, this behavior was investigated on 



the MNIST digits in Section [5. 3. 3[ The basic properties for the networks considered in this section 
are shown in Table O 

We start by considering the moderately sized networks from the DIMACS implementation 
challenge, as these networks are commonly used for the purpose of measuring realistic algorithm 
performance. Figure [18] shows analysis results for 6 networks from this collection, where we evaluate 
the performance and feasibility of the Push algorithm for approximating the leading semi-supervised 
eigenvector. 



Network name Number of nodes Number of edges 

DIMACS10/de2010 24,115 116,056 

DIMACS10/ct2010 67,578 336,352 

DIMACS10/il2010 451,554 2,164,464 

DIMACSlO/smallworld 100,000 999,996 

DIMACS10/333SP 3,712,815 22,217,266 

DIMACS10/AS365 3,799,275 22,736,152 

LAW/arabic-2005 22,744,080 1,107,806,146 

LAW/indochina-2004 7,414,866 301,969,638 

LAW/it-2004 41,291,594 2,054,949,894 

LAW/sk-2005 50,636,154 3,620,126,660 

LAW/uk-2002 18,520,486 523,574,516 

LAW/uk-2005 39,459,925 1,566,054,250 

Table 2: Summary of the networks considered in this section. Some of these networks are directed 

and have been symmetrized for the purpose of this analysis, i.e., the number edges in this table 
refer to the number of edges in the undirected graph. 



As stated in Section |3.3| diffusion based procedures such as the Push algorithm can be used 
to solve our objective for 7 < 0. The impact of the reduced search range is that such procedures 
may not be able to produce a uniform correlation distribution for a set of semi-supervised eigen- 
vectors. Hence, the leading solution(s) will instead pickup too much correlation, because sufficient 
mass cannot to diffuse away from the seed set. However, the effect of a non- uniform correlation 



distribution was analyzed on the MNIST data in Section 5.3, where we found that the performance 
of a downstream classifier is fairly robust to such non-uniformities, as seen by the simplex in Figure 
[9j Consequently, we emphasize that in a large-scale setting such side effects of diffusion based 
procedures is offset by the advantage of a greatly improved time complexity as compared to solving 
the system of linear equations, that implicitly touch every node. 



For each of the 6 analyzed networks in Figure 18, we run two experiments considering different 



seeds, using respectively a high degree and low degree single seed node. Figure 1 8 ( a) p8(c)] considers 
census block networks characterized by heavy-tailed degree distributions, whereas Figure |18(d)f 
|18(f)| considers more densely connected synthetic networks. For each of these 6 networks the 
speedup is measured by comparing with a standard conjugate gradient implementation using a 
tolerance of le-6, and we stress that this tolerance cannot be directly compared with e in the Push 
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Figure 18: For each network the first row depicts how the correlation decays as a tends towards 
0, whereas the bottom row shows the speedup relative to the standard approach using conjugate 
gradient with a tolerance of le-6, that is the default approach in our software distribution. Besides 
the three considered values of e the correlation plots also illustrate the decay based on conjugate 
gradient (black curve), however this may be difficult to see, as the Push algorithm for e = le-4 
coincides with that solution. Finally, seeds based on a high degree and low degree node are presented 
in respectively the first and last column, and the degree distribution for the network is visualized 
in a minor overlapping plot. 



algorithm. Moreover, we test three different settings of the e parameter, and we emphasize that 
for e = le-4, the Push algorithm produces a similar result as the conjugate gradient algorithm. In 
Figure [Is] this can be seen by the red curve (e = le-4) in the correlation decay plots (see the figure 
caption) being on top of the black curve (conjugate gradient). 

Common for Figure 18(a)p8(cy are that low degree seed nodes yield very localized solutions for 
the entire range of a, opposed to the high degree nodes that all succeed in gradually reducing the 
correlation when a is reduced. Also, the choice of e is obviously very important, ie., choosing it 
too large results in a solution that correlates too much with the seed, whereas choosing it too small 
means that we will be touching more nodes than necessary, resulting in a performance penalty. In 
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general the networks analyzed in Figure 18(a)p8(c)' are too small to yield significant performance 



improvements over the conjugate gradient algorithm, and the Push algorithm is only competitive 
for large values of a. 

For the network in Figure |18(d)[ we see similar performance characteristics as the networks 



analyzed in Figure 18(a)p8(c)| due to its small size. However, the two final networks analyzed 
in Figure |18(e)||18(f)| share similar characteristics in terms of the degree distribution, but due to 
a much larger size they show significant performance improvements over the conjugate gradient 
algorithm. Interestingly, the Push algorithm instantiated with e = le-4 yields a greater speedup 
in some settings, which may be explained by faster convergence, caused by a reduced threshold 
for distributing mass. Hence, the running time of the Push algorithm may not always decrease 
monotonically as e increases. 

In general it seems that seeding in a sparsely connected region of a network results in a solution 
having a large correlation with the seed for most values of a. This is obviously a limiting factor 
if we are interested in using the peeling procedure to find multiple semi-supervised eigenvectors in 
that particular region. However, for large networks and more densely connected regions the benefit 
of the Push algorithm is immediate. 

Finally, we scale up to demonstrate that we can adapt the notion of semi-supervised eigenvectors 
to large datasets, and we do so by analyzing 6 large web-crawl networks. These networks are large 
enough that touching all nodes is infeasible, i.e., conjugate gradient is not a feasible option, so in 



Figure 19 we resort to absolute timings. For the analysis results shown in Figure 19, we are solely 
interested in giving the reader some intuition about the running time in a large-scale setting, as 
well as an idea on how the parameters interplay. Hence, we only consider experiments where we 
seed in a high degree node, as these are likely yield the worst running times, but also succeed in 



reducing the correlation the most. This will make the peeling procedure described in Section 4.2 
applicable, allowing us to obtain multiple semi-supervised eigenvectors. As seen for all networks 
analyzed in Figure 19(a)p9(f)] the solution is highly sensitive to the choice of e, but for all networks 



we are able to reduce the correlation when a tends towards in case of e = le-6. We emphasize 
that the reason for e being smaller for these experiments, as compared to the previous is that the 
seed is normalized to have unit norm, implicitly requiring a lower e when the network increases in 
size. 

For diffusion based procedures to be useful with respect to the computation of semi-supervised 
eigenvectors, mass must be able "bleed" away from the seed set and into the surrounding network. 
Otherwise only few semi-supervised eigenvectors can be found as the leading solution(s) become 
too correlated with the seed set. For moderately sized problems conjugate gradient performs very 
well, but in a large-scale setting, as considered here, the presented approach proves very efficient, 
allowing us to compute approximations to semi-supervised eigenvectors in networks consuming 
more than 30GB of working memory. Obtaining an improved understanding of how the method of 
semi-supervised eigenvectors can be used to perform common machine learning tasks on graphs of 
that size is an obvious direction raised by our work. 

6 Conclusion 

We have introduced the concept of semi-supervised eigenvectors as local analogues of the global 
eigenvectors of a graph Laplacian that have proven so useful in a wide range of machine learning 
and data analysis applications. These vectors are biased toward prespecified local regions of interest 
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Figure 19: Visualizes results for applying the Push algorithm to 6 very large web-crawl networks. 
For all networks we seed in the node with the highest degree. The top plot in each subfigure shows 
the correlation decay as a function of a, whereas in the bottom plot we resort to absolute timings 
as the conjugate gradient algorithm is not feasible in this setting, as opposed to showing speedups 



as in Figure 18 



in a large data graph; and we have shown that since they inherit many of the nice properties of the 
usual global eigenvectors, except in a locally-biased context, they can be used to perform locally- 
biased machine learning. The basic method is conceptually simple and involves solving a sequence 
of linear equation problems; we have also presented several extensions of the basic method that 
have improved scaling properties; and we have illustrated the behavior of the method. Due to the 
speed, simplicity, stability, and intuitive appeal of the method, as well as the range of applications in 
which local regions of a large data set are of interest, we expect that the method of semi-supervised 
eigenvectors can prove useful in a wide range of machine learning and data analysis applications. 
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A Supplementary Proofs 

Lemma 1 Given an SPSD matrix M and some vector x where x~^ x — 1, it holds that 

lim (m + ujxx~^\ " ( (^ ~ ^^^) ^ (^ - ^^^) ) • (14) 

Proof: Prior to applying the pseudo inverse, x is clearly an eigenvector with eigenvalue A = on 
the right hand side, and for left hand side x is an eigenvector with eigenvalue A = oc. Hence, 
without loss of generalizability we can decompose M — axx^ + X^AXj, where A is a diagonal 
matrix, such that M+ = a^^^ + X±A'^X]^. First we consider the expansion of the left hand side 



of Eqn. (14) 



lim ({a + cj) xx~^ + X^KXjY = lim —^xx~^ + X^A+X| = X^A+X|. 
Similar, by expanding the right hand side we get 

/ - xx~^\ (axx'^ + X^AXj) (l - xx'^)Y = (x^AXJY = X^A+X7. 



Lemma 2 For M' = M ^ uj^i ^ixj where u > it holds that \k{M') > \k{M). 
Proof: All eigenvalues of the sum of rank-1 perturbations are non- negative 

bj ^ XixJ hO^M'yM. 



Lemma 3 Given an orthonormal basis, X — [xi, . . . , Xn-i], i.e.^ X^ DqX — I , and unit length seed 
8^ Dqs — 1. Then, any unit length vector x^DcXn = I, perpendicular to the subspace X^ DcXn = 
0; will have a correlation with the seed bounded by 

n-l 



< {xlDcsf < 1 - J2(^JDgs) 



i=l 



Proof: The proof follows directly from the Pythagorean theorem. Let X = [xi, . . . ^xn] be the 
orthonormal basis of R^, i.e., spanning s. Then 



N 



J2ixjDGsf = {s'^Dosf = 1. 



i=l 
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Lemma 4 For the matrix P^ — Cq — ")I it holds that 

P+-P+ = {j-^)P+P+, (15) 

given that neither 7 nor 7 coincides with an eigenvalue of Co- 

Proof: The proof follows directly by plain algebra. Simply substitute the SVD P^ — yA^y^, where 
A-y is a diagonal matrix with the eigenvalues shifted by 7, into Eqn. (|15[) 



VA+y^ - VAfV^ = (7 - j)VAfV^VA+V^ 
VA+V^ - VA+V^ = (7 - j)VA+A+V^ 
^A+-A+ = (7-7)A+A+. 

The system is decoupled so it will be sufficient to consider a single eigenvalue 

1 1 7 ~ 7 

Ai-7 Ai-7 (Ai-7)(Ai-7) 
Ai-7 Ai-7 7-7 



(A, - 7) (A, - 7) (A. - 7) (A. - 7) ( A^ - 7) (A. - 7) 

7-7 ^ 7-7 

(A,-7)(A.-7) (A,-7)(A.-7)* 

Also, this trivially holds for the rank deficient case, i.e., = 0. 



Lemma 5 As pointed out in Section \37j[ it is already immediate that the initial semi- supervised 
eigenvector can be computed using a diffusion-has ed procedure, such as the Push algorithm. How- 
ever, from that discussion it remains unclear how the approach can he generalized for the consecutive 
k — 1 semi- supervised eigenvectors. It turns out that the k^^ solution is approximated hy 

Xk^c{I-XX^DG){LG-lkDG)^DGS, (16) 

given that (Lq — ^kDG)^DGS is linearly independent with respect to the previous k — 1 solutions 
contained in X. 

Proof: By Eqn. ^ the solution for the second semi-supervised eigenvector can be expressed as 

y2 = c {P^ + - P^,+yi{yJP^,+yi)+yJP^,+) o'fs, 

where {y^P^^'^yi)^ is a constant. For notational convenience we start by substituting h — Dq s 
together with the explicit solution yi oc P^^^h 

^ qP +5 — ^72 ^11 bo Pj^ Pj^ b 

u -L^-i -L^r^ Jr^-i u 

and for the same reason we also introduce P7172 — b^ P^^^P^^^b 



7/2 = cP^^^b 



Lf -i ^-| JT ryi-. IT ry^ (J 



48 



We can approximate this expression by exploiting the structural result of Lemma [4} namely that 

Pjl^ - Pj2^ = (71 - 72)-P72 + -P7i + 



y2 ~ cP^2 b 



cP^^^b ■ 



^^7172 1-' 71 -'72 /" 

bTP,, + {P,,+ -P,, + )b 

P717I ~ P7172 



We emphasize that this approximation is exact whenever P^^^ — P^2^ is well-conditioned, and 
singular for 71 =72. Then, substitute c = ^^itl ^7172 

^7171 

^7171^72 ^ ~ ^7172^72 ^ P7i72V^7i ~ ^72 J^ 
^2 ^ 

P7171 P7171 

^7171^72 ^ ~ ^7172^72 ^ ~ ^7172^71 ^ ' ^7172^72 ^ 

P7171 
^7171^72 ^ ~ ^7172^71 ^ 



= ^72 + ^ 



P7171 
P71 72^71 



P7171 

By resubstituting for the auxiliary variables we obtain the desired result 
and by applying this procedure recursively it follows that 

Finally, we can relate this result to the combinatorial graph Laplacian as follows 

Vk ~ c{I - D]i''xx'^D]i'')D]l\LG - -/kDoVDcs 
= ciD^^ - d'^XX^DgXIg - jkDG)+DGS 
= cD]1\i - XX^Dg){Lg - jkDG)+DGS, 

— 1/2 

and due to the relationship Xk = Dq y^ it follows that 

Xk « c{I - XX^DgXLg - jkDG)+DGS. 
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B Derivation of sparse graph diffusions. 



To allow efficient computation of semi-supervised eigenvectors by graph diffusions, we must make 
the relationship with the sparse seed vector explicit. Here we specifically consider the derivation of 
Eqn. (113). Given a sparse seed indicator 5o, we can write the seed vector 5 as 5 oc Dq ^ {I—vov^)so^ 
where v^ oc diag(i?^/^) is the leading eigenvector of the normalized graph Laplacian (corresponding 
to the all-one vector of the combinatorial graph Laplacian). Using this explicit form of s we can 
rewrite the leading solution as 

xi = c{Lg - ^Dg)^Dgs 

= cD-J'\Cg - ^iyD'fD-'/\l - vov^)so 
= cD-^/^ {{Cg - -fl^so - {Cg - -flVvov^so) . 

Since Cg — ")I simply shifts the eigenvalues of Cg by —7, the latter term simplifies to 
xi = cD^""'^ { {Cg - jl^so - i -^vqVq j vqVqSo j 
= cD^^^'^ I {Cg - 7!)^ So + -vqVqSq j 

r^-1/2 ( 1 T^-l/2 ( 7 nV2 A 1 T 



Finally, by exploiting the peeling result in Eqn. (16), we can use the Push algorithm to approximate 



the sequence of semi-supervised eigenvectors in an extremely efficient manner 
as the Push algorithm is only applied on the sparse seed set. 

C Nystrom Approximation for the Normalized Graph Laplacian 

The vanilla procedure is as follows; we choose m samples at random from the full data set, and for 
notational simplicity we reorder the samples so that these m samples are followed by the remaining 
n — N — m samples, i.e., we can partition the adjacency matrix as 



^G^ \ tdT 



A B 
B^ C 



where A G R^x^, B G R^><^, and C G M^""^, with N = m + n and m < n. The Nystrom extension 
then approximates the huge C matrix in terms of A and S, so the resulting approximation to weight 
matrix becomes 



Ag^Ag^{^t b^A-'b) 
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Hence, rather than encoding only each nodes k-nearest-neighbors into the weight matrix, the 
Nystrom methods provides a low-rank approximation to the entire dense weight matrix. Since 
the leading eigenvectors of Dq ' AqDq ' correspond to the smallest of Lq^ our goal is to diago- 
nalize Dq ' AqDq ' . At the risk of washing out the local hetrogeneties the Nystrom procedure 

— 1/2 —1/2 ~ "' 

approximates the largest eigenvectors of Dq ' AqDq ' using the normalized matrices A and B 

Aij^—=A=, z,j = l,...m 

Bij^^=^^=, i = l,...m, j = l,...,n. 

y didjj^m 

Finally, let UAU^ be the SVD of A + Jl^^/^S^-^Jl"^/^, then the m leading eigenvectors are ap- 
proximated by 

V^( ^^\ i-V2f/A-V2^ 
and the normalized graph Laplacian by Cq ~ -^ — VAV^. 
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